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The  aim  of  this  research  was  to  develop  one  solution  to  the  speech  inverse  filtering 
problem  and  to  develop  a flexible,  high  quality  articulatory  speech  synthesis  tool.  The 
results  of  this  study  will  be  of  interest  to  researchers  in  speech  modeling,  analysis,  and 
synthesis.  A software  program  called  ARTM  was  implemented  as  an  articulatory 
synthesis  tool.  One  feature  of  this  research  tool  is  the  simulated  annealing  optimization 
procedure  that  is  used  to  optimize  the  vocal  tract  parameters  to  match  a specified  set  of 
formant  characteristics.  Another  aspect  of  this  study  is  the  derivation  of  a new  form  of  the 
acoustic  equations  that  include  the  subglottal  system,  the  glottal  impedance,  the 
turbulence  noise  source,  and  the  nasal  tract  with  sinus  cavities  for  the  articulatory 
synthesizer. 

A flexible  articulatory  model  was  designed  with  special  interfaces  that  provide  for 
numerical  specification  of  parameters  as  well  as  sliding  bar  capabilities  that  allow 
parameter  adjustments.  A transmission-line  circuit  model  of  the  vocal  system,  which 
includes  the  vocal  tract,  the  nasal  tract  with  sinus  cavities,  the  glottal  impedance,  the 
subglottal  tract,  the  excitation  source,  and  the  turbulence  noise  source,  was  constructed. 

viii 


The  acoustic  equations  of  the  vocal  system  were  rederived  for  the  proposed  articulatory 
synthesizer.  A digital  time-domain  approach  was  used  to  simulate  the  dynamic  properties 
of  the  vocal  system  as  well  as  to  improve  the  quality  of  the  synthesized  speech. 

A new  efficient  analysis  scheme,  identifying  the  articulatory  parameters  from  the 
acoustic  speech  waveforms,  was  induced.  The  algorithm  is  known  as  simulated 
annealing,  which  is  constrained  to  avoid  non-unique  solutions  and  local  minima  problems. 
The  constraints  were  determined  by  the  articulatory-to-acoustic  transformation  function 
and  the  boundary  conditions  for  the  articulatory  parameters.  The  cost  function  was 
defined  as  a percentage  of  the  weighted  least-absolute-value  error  distance  between  the 
first  four  formant  frequencies  of  the  articulatory  model  and  the  first  four  formant 
frequencies  determined  from  speech  analysis.  A 1%  error  criterion  was  found  to  be  both 
practical  and  achievable. 
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CHAPTER  1 
INTRODUCTION 


Speech  is  perhaps  the  most  unique  capability  of  the  human  species.  Speech  is  our 
everyday  communication  medium.  Thus,  it  is  natural  that  engineers  and  speech  scientists 
have  an  interest  in  analyzing,  recognizing,  and  synthesizing  speech.  Basically,  there  are 
three  areas  of  speech  science  research.  They  are  speech  acoustics,  speech  perception,  and 
speech  physiology.  One  of  the  most  important  aspects  of  instrumentation  for  speech 
science  is  speech  synthesis,  which  can  serve  as  a model  of  speech  production  and  provide 
a mechanism  for  the  mechanical  production  of  speech.  An  articulatory  speech  synthesizer 
is  a marriage  of  acoustic  and  physiological  techniques  as  well  as  a model  of  the  human 
articulatory  system.  The  aim  of  this  dissertation  is  to  construct  an  articulatory  speech 
synthesis  software  system  for  the  study  of  speech  acoustics  and  speech  physiology. 

Understanding  the  human  speech  production  process  is  important  not  only  in 
speech  synthesis  but  also  in  automatic  speech  recognition  and  in  the  digital  coding  of 
speech.  We  first  introduce  the  mechanisms  of  speech  production,  followed  by  an 
overview  of  some  existing  speech  synthesis  models.  We  then  outline  the  goals  and  the 
plans  of  this  research,  and  describe  the  content  of  other  chapters. 

1.1  The  Mechanisms  of  Speech  Production 

When  developing  speech  synthesis  for  its  many  possible  applications,  such  as  a 
broad  range  of  telecommunications  applications,  aids  for  the  handicapped,  and  the 
diagnosis  of  articulation  deficiencies,  it  is  helpful  to  have  an  understanding  of  the 
mechanisms  of  speech  production  so  that  these  processes  can  be  modeled. 

Figure  1-1  is  a midsagittal  section  of  a portion  of  the  human  body,  showing  the 
appropriate  organs  for  speech  production,  which  include  the  lungs,  larynx,  pharynx,  nose. 
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and  various  parts  of  the  mouth.  The  vocal  tract  begins  at  the  larynx  and  extends  through 
the  pharynx  and  mouth  to  the  lips.  The  cross-sectional  area  of  the  vocal  tract,  determined 
by  the  positions  of  the  articulators  including  the  tongue,  lips,  jaw,  and  velum,  varies  from 
zero  to  approximately  20  cm2.  The  nasal  tract  begins  at  the  velum  and  ends  at  the  nostrils. 
The  degree  of  closure  of  the  velopharyngeal  port,  so  named  because  the  entrance  lies 
between  the  velum  and  the  walls  of  the  pharynx,  controls  the  coupling  between  the  vocal 
tract  and  the  nasal  tract  for  producing  certain  sounds. 

1.1.1  Speech-sound  Sources 

Speech-sound  sources  are  associated  with  the  partial  conversion  of  the  flow  and 
air  pressure  of  the  air  stream  emanating  from  the  lungs  into  acoustic  energy.  Air  flow  acts 
as  the  acoustic  sound  source  in  speech.  There  are  three  types  of  acoustic  sources  for 
speech:  (1)  the  quasi-periodic  modulation  of  the  air  stream  by  the  vibrating  vocal  folds 
during  phonation,  (2)  the  generation  of  turbulence  at  a constriction  or  obstruction  in  the 
vocal  tract,  and  (3)  the  sudden  release  of  pressure  built  up  against  a closure  in  the  vocal 
tract.  These  correspond,  respectively,  to  a quasi-periodic  voice  source,  a quasi-random 
turbulence  noise  or  friction  source,  and  a transient  source.  These  various  sound  sources 
are  critical  for  speech  production. 

1.1.2  Acoustic  Modulation 

The  acoustic  energy  generated  by  a sound  source  for  speech  production  is 
modified  before  it  is  radiated  as  a speech  waveform.  The  selective  transmission 
characteristics  of  the  cavities  preceding  and  following  the  acoustic  source,  together  with 
the  sound  radiation  characteristics,  influence  the  character  of  the  acoustic  speech  wave. 
This  effect  is  similar  to  the  resonance  effect  observed  with  organ  pipes  or  wind 
instruments.  Thus,  the  vocal  tract  may  be  considered  acoustically  similar  to  a tube.  The 
primary  resonant  modes  of  the  vocal  tract  are  known  as  formants.  A formant  is  specified 
by  its  frequency,  bandwidth,  and  amplitude.  The  variation  of  the  cross-sectional  area  of 
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the  vocal  tract  along  its  length  is  the  most  significant  determinant  of  the  formant 
frequencies.  Different  sounds  are  formed  by  varying  the  shape  of  the  vocal  tract. 

1.2  Speech  Synthesis  Models 

The  fundamental  principles  of  sound  generation  in  the  vocal  tract  and  the  acoustic 
filtering  behavior  of  the  tract  are  well  known,  although  some  of  the  time -varying  and 
nonlinear  characteristics  of  vocal  fold  vibration  and  source-tract  interaction  remain  to  be 
studied  and  quantified. 

The  synthesis  of  speech  has  been  studied  in  great  detail,  both  because  of  its  broad 
applications  and  because  of  its  contribution  to  a better  understanding  of  speech.  Speech 
synthesis  provides  a model  for  evaluating  the  significance  of  speech  parameters  obtained 
by  acoustic  analysis.  Speech  synthesis  can  be  useful  in  the  study  of  phonetics,  since  each 
acoustic  parameter  can  be  controlled  independently  and  arbitrarily.  There  are  essentially 
three  different  methods  used  for  speech  synthesis:  formant  synthesis,  synthesis  by  linear 
prediction  (LP),  and  articulatory  synthesis. 

1.2.1  Formant  Synthesis 

Formant  synthesis  is  based  on  the  source-filter  model  of  speech  production  (Fant, 
1960).  The  tract  acts  as  a filter  with  various  resonances,  or  formants,  to  shape  the  spectral 
characteristics  of  speech.  In  general  terms,  the  source  signal  is  thought  of  as  being  either 
periodic  for  the  voiced  sounds,  or  noise-like  for  unvoiced  sounds.  The  formant 
frequencies,  amplitudes,  and  bandwidths  can  be  implemented  in  the  form  of  a digital 
filter.  The  excitation  source  and  the  spectral  shaping  network  (filter)  that  make  up  a 
formant  synthesizer  must  be  varied  dynamically  to  mimic  the  changes  that  occur  in  the 
source  characteristics  and  the  vocal  tract  shape  during  speech  production.  Since  these 
changes  occur  relatively  slowly,  it  is  possible  to  use  a set  of  synthesizer  parameters  (as 
control  signals  for  the  formant  synthesizer)  to  specify  a short  segment  (frame)  of  the 
speech  signal.  These  parameters  can  be  used  to  reduce  the  amount  of  data  needed  to 
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represent  the  speech  signal  so  that  the  data  rate  is  approximately  3 kbits/s.  Formant 
synthesis  offers  the  possibility  of  synthesizing  a new  utterance  from  theoretical 
parameters. 

At  present,  there  are  two  commonly  used  formant  synthesis  models,  the 
cascade/parallel  formant  synthesizer,  Figure  1-2,  developed  by  Klatt  (1980,  1990)  and  the 
versatile  parallel  formant  synthesizer,  Figure  1-3,  developed  by  Rye  and  Holmes  (1982). 
Although  there  has  been  dissent  about  which  of  the  two  systems  is  the  better  (Holmes, 
1983),  it  is  generally  agreed  that  the  Klatt  model  has  been  favoured  for  text- to- speech 
synthesis,  while  the  Holmes  model  tends  to  be  used  for  synthesis-by-analysis  systems. 
The  reasons  for  this  are  probably  related  more  to  the  way  in  which  the  different  synthesis 
models  are  controlled,  rather  than  the  inherent  capabilities  of  the  synthesizers  themselves. 
It  has  been  demonstrated  that  high-quality  speech  can  be  generated  with  such  synthesizers 
(Klatt,  1980;  Klatt  and  Klatt,  1990).  However,  the  control  tables  are  complicated. 

Clearly,  such  a model  has  no  simple  relationship  to  an  articulatory  specification  of 
the  vocal  tract.  Although  it  cannot  properly  represent  the  effects  of  varying  glottal 
impedance  and  subglottal  coupling,  the  subtleties  of  vocal  fold  motion,  etc.,  many 
successful  speech-synthesis  systems  are  based  on  formant  synthesis  since  it  is  possible  to 
make  a functional  approximation  to  these  effects. 

1.2.2  Linear  Prediction  1LP1  Synthesis 

In  1971  a new  technique  was  developed  for  analyzing  and  synthesizing  speech 
using  computers.  The  method,  known  as  linear  prediction  (Atal  and  Hanauer,  1971),  has 
been  widely  accepted.  With  linear  prediction,  a linear  polynomial  is  used  to  predict  the 
subsequent  values  of  the  speech  waveform  from  previous  values  of  the  speech.  The  LP 
model  assumes  that  the  speech  was  generated  by  an  impulse  or  white  noise  excitation  (see 
Figure  1-4).  The  synthetic  speech  produced  by  this  model  is  usually  intelligible,  but  often 
exhibits  unnatural  characteristics.  To  improve  the  quality,  the  error  signal  (residue 
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Figure  1-3:  Holmes’  (1982)  versatile  parallel  formant  synthesizer. 


8 


Pitch 


Coefficients 


Synthetic 

Speech 


Figure  1-4:  Basic  structure  of  LP  synthesis. 
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Figure  1-5:  Basic  structure  of  articulatory  synthesis. 
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between  the  predicted  signal  and  the  original  speech)  can  be  used  as  the  excitation. 
However,  this  requires  as  much  information  as  the  original  waveform,  so  no  economy  in 
bit  rate  is  achieved.  The  speech  generated  by  LP  synthesizers  often  sounds  “buzzy,”  and 
because  such  a model  is  an  all-pole  model,  nasal  and  obstruent  sounds  are  difficult  to 
reproduce  (Childers  and  Wu,  1990).  In  recent  years  multipulse  excitation  has  shown 
promise  for  improving  the  quality  of  LP  synthetic  speech  (Atal  and  Remde,  1985;  Singhal 
and  Atal,  1989). 

1.2.3  Articulatory  Synthesis 

Both  formant  and  LP  synthesis  methods  are  acoustic-domain  models  of  speech, 
and  there  is  no  interaction  between  the  glottal-flow  excitation  function  and  the  vocal-tract 
filter  function. 

Articulatory  synthesis  is  the  production  of  speech  sounds  using  a model  of  the 
vocal  tract,  which  directly  or  indirectly  simulates  the  movements  of  the  speech 
articulators.  It  provides  a means  for  gaining  an  understanding  of  speech  production  and 
for  studying  phonetics.  In  such  a model  coarticulation  effects  arise  naturally,  and  in 
principle  it  should  be  possible  to  deal  correctly  with  glottal  source  properties,  interaction 
between  the  vocal  tract  and  the  vocal  folds,  the  contribution  of  the  subglottal  system,  and 
the  effects  of  the  nasal  tract  and  sinus  cavities. 

Articulatory  synthesis  usually  consists  of  two  separate  components  as  shown  in 
Figure  1-5.  In  the  articulatory  model,  the  vocal  tract  is  divided  into  many  small  sections 
and  the  corresponding  cross-sectional  areas  are  used  as  parameters  to  represent  the  vocal 
tract  characteristics.  In  the  acoustic  model,  each  cross-sectional  area  is  approximated  by 
an  electrical  analog  transmission  line.  To  simulate  the  movement  of  the  vocal  tract,  the 
area  functions  must  change  with  time.  Each  sound  is  designated  in  terms  of  a target 
configuration  and  the  movement  of  the  vocal  tract  is  specified  by  a separate  fast  or  slow 
motion  of  the  articulators. 
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At  the  present  time  the  complexity  of  articulatory  synthesis  is  partially  due  to  the 
analysis  procedure,  which  usually  requires  an  “articulatory-to-acoustic  inverse 
transformation”  from  the  speech  signal,  i.e.,  speech  inverse  filtering.  The  complexity  of 
the  relationship  between  articulatory  gestures  and  the  acoustic  signal  makes  it  very 
difficult  to  generate  automatically  the  details  of  articulatory  control  needed  to  produce  a 
synthetic  copy  of  a given  sample  of  human  speech.  Despite  such  drawbacks,  articulatory 
speech  synthesis  has  several  advantages: 

CO  The  model  has  a direct  relation  to  the  human  speech  production  process. 
Consequently,  it  is  conjectured  that  articulatory  synthesis  may  lead  to  a simpler 
and  more  elegant  synthesis  by  rule,  e.g.,  text-to-speech  applications 
(Parthasarathy  and  Coker,  1990,  1992)  and  articulation-based  speech 
recognition  systems  (Erler  and  Deng,  1993). 

\2\  The  articulatory  parameters  in  the  human  voice  production  system  vary 
slowly.  Consequently,  researchers  have  suggested  that  these  parameters  are 
potential  candidates  for  efficient  coding,  e.g.,  low  bit-rate  speech 
communication  (Flanagan  et  al.,  1980). 

02  To  the  extent  that  we  can  accurately  obtain  the  speech  gestures  (articulatory 
movements  or  trajectories),  articulatory  synthesizers  may  be  valuable  for 
research  scientists  and  physicians,  since  the  synthesizers  can  be  used  to  study 
linguistic  theories,  to  provide  a feedback  mechanism  for  teaching  speech 
production,  and  to  explore  the  effects  of  vocal  tract  surgical  techniques  on 
speech  production  prior  to  surgical  intervention  (Childers,  1991);  and  they 
hold  out  the  ultimate  promise  of  high  quality,  natural-sounding  speech  with  a 
simple  control  scheme  (Klatt,  1987). 

A properly  constructed  articulatory  synthesizer  is  capable  of  reproducing  all  the 
naturally  relevant  effects  for  the  generation  of  fricatives  and  plosives,  modeling 
coarticulation  transitions  as  well  as  source-tract  interaction  in  a manner  that  resembles  the 
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physical  process  that  occurs  in  real  speech  production.  Articulatory  synthesizers  will 
continue  to  be  of  great  importance  for  research  purposes,  and  to  provide  insights  into 
various  acoustic  features  of  human  speech.  Thus,  an  articulatory  synthesizer  may  provide 
both  an  efficient  description  of  natural  speech  and  a means  for  synthesizing 
natural-sounding  speech.  However,  a major  problem  with  the  articulatory  synthesizer  is 
the  lack  of  a means  to  derive  articulatory  configurations  from  the  speech  signal  using 
speech  inverse  filtering.  This  study  addresses  this  issue. 

1.3  Research  Goals  and  Methodology 

1.3.1  Research  Goals 

A.  To  build  a flexible,  user-friendly,  and  high  quality  articulatory  synthesis  tool. 

B.  To  develop  a comprehensive  acoustic  model  that  includes  the  subglottal  system, 
glottal  impedance,  vocal  tract,  nasal  tract  with  sinus  cavities,  and  acoustic  radiation  for  the 
articulatory  synthesizer. 

C.  To  develop  one  solution  for  the  speech  inverse  filtering  problem. 

1.3.2  Research  Methodology 

A.  Establish  a flexible  and  user-friendly  environment  for  the  articulatory 
synthesizer  with  the  following  display  features: 

1 . the  synthetic  speech  time  waveform, 

2.  the  articulator  movements  (gestures)  for  synthesizing  words  or  sentences, 

3.  the  cross-sectional  area  and  acoustic  transfer  function  of  the  vocal  tract, 

4.  the  pressure  and  volume-velocity  waveforms  at  selected  points  in  the  vocal 
tract,  and 

5.  the  excitation  source  waveform  and  power  spectral  density. 

B.  Implement  a new  iterative  optimization  procedure  for  the  articulatory  model. 
The  optimization  procedure  is  based  on  the  simulated  annealing  (SA)  algorithm.  Using 
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this  method,  the  articulatory  parameters  are  optimized  to  minimize  the  error  distance 
between  the  natural  and  the  model-generated  first  four  formants. 

C.  Develop  a new  excitation  source  model  for  the  articulatory  synthesizer  by 
combining  the  Lalwani-Childers  glottal  source  (a  modified  LF-model),  subglottal,  and 
glottal  area  models. 

D.  Modify  and  derive  the  vocal  tract  and  the  nasal  tract  models  to  be  able  to 
calculate  the  acoustic  transfer  function  from  selected  points  in  the  vocal  tract,  to  insert  the 
noise  source  model  at  the  center  or  the  oudet  (downstream)  of  the  constriction,  or 
distributed  along  a specified  spatial  interval,  and  to  simulate  the  viscous,  heat  conduction, 
and  yielding  wall  losses  in  the  vocal  and  nasal  tracts.  This  new  model  also  includes  the 
effects  of  the  sinus  cavities. 

E.  Synthesize  speech  tokens  by  varying  the  following  parameters: 

1.  the  number  of  vocal  tract  sections, 

2.  synthesis  sampling  frequency, 

3.  velopharyngeal  port  opening,  and 

4.  excitation  source  parameters. 

1.4  Description  of  Chapters 

Chapter  2 describes  the  implementation  of  the  articulatory  model  and  the 
development  of  the  acoustic  model.  A modified  Mermelstein’s  articulatory  model  is 
designed  using  XView,  devguide,  and  C functions.  An  acoustic  model  of  speech 
production  that  includes  the  subglottal  system,  glottal  impedance,  vocal  tract,  nasal  tract 
with  sinus  cavities,  and  acoustic  radiation  is  realized  to  generate  speech. 

Chapter  3,  speech  inverse  filtering,  describes  how  the  articulatory  model  derives 
the  vocal-tract  configuration  from  acoustic  features.  The  formant  frequencies,  extracted 
from  real  speech,  are  specified  as  acoustic  features  and  a simulated  annealing  algorithm  is 
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used  to  determine  the  articulatory  parameters  that  minimize  the  error  distance  between  the 
specified  and  the  model-generated  first  four  formant  frequencies. 

A flexible,  user-friendly,  high  quality  articulatory  synthesis  tool  is  implemented. 
Chapter  4 describes  the  software  features  and  the  design  concepts.  We  also  illustrate 
operation  of  the  software  system. 

Chapter  5 presents  several  synthesized  speech  examples  using  the  articulatory 
synthesis  tool.  Several  experiments  illustrate  the  effects  of  changing  various  parameters, 
thereby  modeling  features  of  the  speech  production  process  and  providing  a mechanism  to 
evaluate  the  effects  of  software  implementation  factors. 

The  final  chapter  draws  conclusions  and  describes  research  extensions.  We  also 
attach  several  appendices  as  supplements  to  this  research.  The  appendices  include 
articulatory  and  acoustic  characteristics  of  typical  American  vowels,  acoustic  transfer 
function  calculations,  acoustic  equation  derivations,  and  an  outline  of  the  optimization 
procedure. 


CHAPTER  2 

ARTICULATORY  SYNTHESIZER  MODEL 


The  articulatory  synthesizer  is  based  on  a model  of  the  physiology  of  the  human 
speech  production  process.  As  shown  in  Figure  1-4,  the  articulatory  synthesizer  has  two 
components.  The  articulatory  model  represents  the  articulatory  positions  and  converts 
them  into  vocal  tract  cross-sectional  area  functions.  The  acoustic  model,  which  includes 
subglottal  coupling,  source-tract  interaction,  vocal  tract,  nasal  tract  with  sinus  cavities, 
and  acoustic  radiation,  simulates  the  speech  sound  propagation  through  the  vocal  system 
as  well  as  the  physics  of  the  physiological-to-acoustic  transformation.  This  chapter 
presents  the  implementation  of  the  articulatory  model  and  the  realization  of  the  acoustic 
model.  Our  articulatory  model  is  based  on  Mermelstein’s  model  (1973)  and  is 
implemented  using  XView  and  devguide  graphical  user  interface  and  C functions.  The 
time-domain  approach  is  used  to  implement  the  acoustic  model,  since  it  offers  the  ability 
to  simulate  the  dynamic  properties  of  the  vocal  system  as  well  as  a method  to  improve  the 
quality  of  the  synthesized  speech.  Methods  for  estimating  articulatory  data  from  acoustic 
measurements  are  reviewed  and  described  in  Chapter  3. 

2.1  Review  of  Articulatory  Models 

According  to  the  acoustic  theory  of  speech  production,  the  human  vocal  tract  can 
be  modeled  as  an  acoustic  tube  with  nonuniform  and  time-varying  cross-sections.  It 
modulates  the  excitation  source  to  produce  various  linguistic  sounds.  The  acoustic  tube 
can  be  adjusted  into  various  shapes  by  moving  articulatory  parameters.  These  articulatory 
parameters  specify  the  positions  of  the  tongue  body,  tongue  tip,  jaw,  lips,  hyoid,  and 
velum.  Articulatory  models  are  well  known  in  the  literature  and  can  be  classified  into  two 
major  types:  parametric  area  models  and  midsagittal  distance  models. 
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2.1.1  Parametric  Area  Models 

Strictly  speaking,  the  parametric  area  models  do  not  represent  articulatory 
positions  directly,  but  rather  concentrate  on  modeling  the  area  function  as  a function  of 
distance  along  the  tract  subject  to  some  constraints  (Stevens  and  House,  1955;  Fant,  1960; 
Atal  et  al.,  1978;  Flanagan  et  al.,  1980;  Lin,  1990;  Yu,  1993).  Their  common  feature  is  a 
specification  of  the  minimum  constriction  area  Ac  and  its  axial  location  Xc.  The  area  of 
the  vocal  tract  is  usually  represented  by  a continuous  function  such  as  a hyperbola,  a 
parabola,  or  a sinusoid  (Lin,  1990).  Consonant  articulations  have  generally  not  been 
implemented.  Figure  2-1  shows  one  example  of  parametric  area  models. 

2.1.2  Midsagittal  Distance  Models 

The  midsagittal  distance  models  are  usually  based  on  a representation  of  the 
midsagittal  plane  as  seen  from  an  X-ray  image.  They  describe  the  speech  organ 
movements  in  a midsagittal  plane  and  require  an  input  to  specify  the  positions  of  the 
articulators  (Mermelstein,  1973;  Levinson  and  Schmidt,  1983;  Sondhi  and  Schroeter, 
1986;  Prado,  1991)  or  to  control  the  movements  of  the  articulators  by  rules  (Coker,  1976; 
Parthasarathy  and  Coker,  1990,  1992).  The  output  is  an  estimate  of  the  vocal  tract 
cross-sectional  area.  Visualization  and  articulatory  state  interpretation  are  the  major 
advantages  of  these  models.  Figure  2-2  shows  one  example  of  midsagittal  distance 
models. 


2.2  Implementation  of  the  Articulatory  Model 

Articulatory  models  are  used  to  transform  articulatory  parameters  to  a vector 
representation  of  the  vocal  tract  cross-sectional  area,  and  from  there,  to  acoustic 
characteristics  of  the  vocal  tract.  Our  articulatory  model  is  a modified  version  of  the 
Mermelstein’s  model  (1973).  By  using  XView,  devguide,  and  C functions,  the  model 
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Figure  2-1 : A example  of  parametric  area  models. 
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Figure  2-2:  A example  of  midsagittal  distance  models. 
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has  been  designed  with  special  interfaces  that  provide  for  numerical  specification  of 
parameters  as  well  as  sliding  bar  capabilities  that  allow  parameter  adjustments. 

Mermelstein’s  model  (1973)  generally  achieves  a match  between  X-ray  tracings 
and  a midsagittal  vocal  tract  outline,  but  there  is  not  enough  information  for  a robust 
representation  of  the  lower  part  of  the  pharynx  and  for  the  region  between  the  tongue  tip 
and  the  jaw.  Our  approach  modifies  the  lower  part  of  the  pharynx  and  optimizes  this 
region  whenever  necessary  (see  Chapter  3).  We  have  also  modified  the  hyoid  and 
tongue-tip-to-jaw  regions. 

2-2.1  Articulatory  Parameters  and  Midsagittal  Vocal  Tract  Outline 

In  the  articulatory  model,  a set  of  variables  is  used  to  specify  the  inferior  outline  of 
the  vocal  tract  (Figure  2—3).  These  variables,  called  articulatory  parameters,  are: 

Tongue  body  center  : This  is  represented  with  an  arc  (DL-B)  of  a circle  with  a 
moving  center  and  fixed  radius.  The  tongue  body  center,  denoted  as  tongc,  has  polar 
coordinates  (sc,  thetaj+thetab)  with  respect  to  the  fixed  point  F.  However,  the  rectangular 
coordinates  (tbodyx,  tbodyy)  are  used  for  display  and  optimization. 

Tongue  tip  : The  tongue  tip  is  represented  by  the  rectangular  coordinates  (tipx, 
tipy)  of  point  T.  Arcs  B-T  and  T-PF,  then,  specify  the  tongue  blade  outline.  Since  the 
location  of  point  B varies  with  the  tongue-body  center  (tongc)  and  the  jaw  angle  (jaw), 
the  tongue  blade  movements  depend  on  the  tongue  body  and  jaw  positions. 

law  : The  point  JAW  with  polar  coordinates  (sj,  thetaj)  are  used  to  represent  the 
jaw  location.  The  distance  sj  is  kept  constant  for  most  phonemes.  The  parameter  j aw  is 
used  to  denote  the  angle  thetaj.  Note  that  the  jaw  concave  is  approximated  by  a polyline, 
a connected  sequence  of  line  segments  (PF-PS-JAW-L6). 

Ups  : The  lips  are  represented  by  points  L5  (upper)  and  L7  (lower).  With  respect 
to  the  point  JAW,  the  coordinates  of  the  lower  lip  are  represented  by  (lipp,  lipo), 
which  specify  the  lip  protrusion  and  lip  opening,  respectively.  The  use  of  lipp  and 
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Figure  2-3:  Articulatory  model  parameters. 
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lip°  as  separate  variables  allows  lip  closure,  lip  separation,  or  rounded  lips.  The  upper 
lip  L5  has  the  same  coordinate  values  with  respect  to  point  U. 

Hyoid  : The  hyoid  is  specified  by  the  parameter  hyoid,  the  distance  from  point 
PP  to  the  line  segment  H-DL.  The  point  PP  is  on  the  normal  bisector  of  the  line  segment 
H-DL,  which  is  tangent  to  the  tongue  body  arc  outline  at  point  DL.  The  line  segment 
DL— PP  and  arc  PP— H as  well  as  the  tongue  body  determine  the  anterior  shape  of  the 
pharynx.  The  point  H represents  the  intersection  of  the  anterior  edge  of  the  epiglottis  with 
the  top  edge  of  the  hyoid  bone.  The  point  K represents  an  estimate  of  the  anterior 
extremity  of  the  larynx. 

The  superior  outline  of  the  vocal  tract  is  represented  by  the  upper  teeth  position  U, 
the  hard  palate  curve  U-N-M,  the  highest  point  on  the  maxilla  M,  the  soft  palate  arc 
M-V,  the  velum  position  V,  the  back  wall  of  pharynx  position  W,  and  the  highest  point  of 
the  periarytenoid  G.  In  the  hard  palate  curve,  the  point  N is  located  on  the  line  segment 
M-U  such  that  the  distance  M-N  is  twice  the  distance  N-U.  Circular  arcs  M-V  and  M-N 
are  drawn  with  centers  on  a vertical  line  through  M.  The  posterior-superior  outline  is 
generally  considered  fixed  except  for  the  soft  palate  curve  near  the  velum  point  V.  To 
specify  the  opening  area  of  the  velopharyngeal  port,  we  treat  the  velum  as  an  articulatory 
parameter. 

Velum  : The  state  of  the  velum  is  represented  by  the  position  V of  the  tip  of  the 
uvula  moving  along  a line  segment  (V-V’).  The  velar  opening  area  is  assumed 
proportional  to  the  distance  between  the  point  V and  the  most  elevated  point  of  the  velum. 
This  distance  is  specified  by  the  variable  velum. 

2.2.2  Determination  of  the  Vocal  Tract  Section  Lengths  and  Cross-sectional  Areas 

The  vocal  tract  cross-sectional  area  function  is  determined  by  the  areas  of  the 
sections,  whose  projections  on  the  X-Y  plane  form  the  sagittal  grids  of  the  vocal  tract,  as 
shown  in  Figure  2-4.  These  grid  lines  vary  with  the  positions  of  the  articulators  (they  are 
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Figure  2-4:  Midsagittal  grids  and  different  area  regions  of  the  articulatory  model. 
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fixed  in  Mermelstein’s  model),  i.e.,  the  interval  between  two  adjacent  parallel  sagittal  grid 
lines  (regions  AR1  and  AR5)  or  the  angle  between  two  adjacent  radial  sagittal  grid  lines 
(rest  regions)  is  not  fixed.  A total  of  60  sections,  59  sections  for  the  vocal  tract  plus  one 
section  (fixed  length  and  area)  for  the  outlet  of  the  glottis,  are  used  in  our  model.  This 
feature  provides  more  reliable  estimates  of  the  sagittal  distances  and  cross-sectional  areas. 

The  distance  between  the  midpoints  of  two  consecutive  sagittal  lines,  Sj  and  sJ  + 1, 
represents  the  length  of  section  j,  slj  (see  Figure  2-4).  The  sagittal  distance  gj  of  section  j 

is  defined  as  the  grid  line  segment  length  between  posterior- superior  and  anterior- inferior 
outlines.  The  sagittal  distances  are  converted  to  cross-sectional  areas  by  an  empiric 
function  based  on  previously  published  data  (Mermelstein,  1973).  In  general,  the 
cross-sectional  area  function  is  formulated  as 

Aj  = F(j>  gj)  ' cosctj  (2.i) 

where  j = 2 (vocal  tract  inlet), . . .,  60  (lips  end),  F(j,  gj)  is  an  empiric  function  and  has  a 
different  formula  for  the  pharyngeal  region,  oral  region,  and  labial  region,  and  Oj  is  the 

deviation  angle  of  the  direction  of  wave  propagation  from  the  normal  to  the  j*  grid  line 
(Mermelstein,  1973;  Rubin  et  al„  1981;  Guo  and  Milenkovic,  1993). 

In  the  pharyngeal  region  (AR1  and  AR9  in  Figure  2-4),  the  empiric  function  is 
FG,  gj)  = n ■ gj  • bj  (2.2) 

where  gj  is  one  axis  and  bj  = gj  + Ag,  where  Ag  E [1.5,  3],  is  another  axis  of  the 
ellipse  since  we  approximate  each  pharyngeal  section  as  an  elliptic  cylinder.  The  bj  is 

proportionally  increased  as  moving  the  grid  line  upward  from  the  larynx.  In  the  soft 
palate  region  (AR2  in  Figure  2-4),  the  empiric  function  has  the  form 

FO,  gj)  = 2.0  • gj-5  (2.3) 

In  the  hard  palate  region  (AR23  in  Figure  2-4),  the  empiric  function  is  given  by 

Fa,  gj)  = 1.6  • gj-5 


(24) 
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For  the  labial  region  (AR5  in  Figure  2-4),  the  empiric  function  is 

F(j>  gj)  = gj  • [2.0  + 1.5  • (lipo  - lipp)] 


For  the  other  region  (AR4  in  Figure  2-A),  the  empiric  function  is 


F(h  gj)  = 


< 0.75  + 3 • (gj  - 0.5), 
5.25  + 5 • (gj  - 2), 


for  gj  < 0.5 
for  0.5  < gj  < 2 
for  gj  > 2 


(2.5) 


(2.6) 


2-2.3  Calculation  of  Formant  Frequencies  from  the  Vocal  Tract  Cross-sectional  Areas 

The  calculation  of  formant  frequencies  from  a given  vocal  tract  cross-sectional 
area  function  has  been  well  established  in  the  acoustic  theory  of  speech  production  (Fant, 
1960;  Atal  et  al.,  1978;  Wakita  and  Fant,  1978;  Badin  and  Fant,  1984;  Fant,  1985;  Lin, 
1990,  1992).  By  computing  the  acoustic  transfer  function  of  a given  vocal  tract 
configuration,  we  can  decompose  the  formant  frequencies  from  the  denominator  of  the 
acoustic  transfer  function.  Refer  to  Appendix  B for  detailed  acoustic  transfer  function 
calculations. 

Let  an  all-pole  acoustic  transfer  function  be 

H(s)  = 1/Hp(s)  (2.7) 

where  s = j2jtf.  The  denominator  Hp(s)  is  normally  a complex  number 

Hp(s)  = Nb(s)  + j • Na(s)  (2.8) 

For  a lossless  vocal  tract  Na(s)  is  zero.  When  the  losses  are  small,  Na(s)  is  small 
compared  with  Nb(s).  Consequently,  the  roots  of  the  complex  function  Hp(s)  should  be 
located  in  the  neighborhood  of  the  roots  of  Nb(s).  Based  on  this  assumption  a two-step 
approach  was  proposed  by  Fant  (1960)  and  was  referred  to  as  the  Nb  method  (Lin  1990, 
1992).  Figure  2-5  illustrates  the  flow  chart  of  the  Nb  method. 

The  first  step  of  the  Nb  method  is  to  search  for  the  roots  of  Nb(s)  = 0.  At  a given 
frequency  fn,  the  value  Nb(j2jtfn)  is  computed.  The  frequency  is  next  increased  (a  few 
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Figure  2 5:  The  flow  chart  of  the  Nb  method  for  decomposition  of  formants. 
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hundred  Hertz)  and  the  value  Nb(j2jtfn+1)  is  computed  at  the  new  frequency  fn+1.  If  the 
polarity  changes,  Nb(j2jtfn)  • Nb(j2jtfn+1)  < 0,  within  this  interval,  there  is  a root  to  be 
detected.  The  Newton’s  approximation  or  other  methods  can  be  used  to  determine  the 
root  frequency  of  Nb(s)  within  this  interval.  Let  f0  be  the  estimated  root  frequency  by 
setting  Nb(s)  = 0.  The  second  step  is  to  account  for  the  finite  Na(s)  by  means  of  a 
first-order  approximation  term  for  Hp(s)  in  the  vicinity  of  j2jtf0: 

Hp(s)  = Hp(j2jtf0)  + (s-j2jtf0)  • Hp(j2jtf0)  (2.9) 

where 


Hp(j2jtf0)  = 


d[Hp(s)] 

ds  s =j2jtf0 


= Na(j2jtf0)  - j • Nb(j2jtf0) 


Set  Hp(s)  = 0 and  let  the  roots  be  denoted  as 

Sn  = On  + j * (2jtf0  + AC0n) 
From  the  above  equations,  we  have 

O - N-'Nb 

” n;2  + Nb2 

A No 

A(0n  = - On  ' -r 

Nb 


(2.10) 


(2.11) 


(2.12a) 

(2.12b) 


The  final  pole  frequency  is  given  by  f n = fG  + Also  we  can  have  the 

in 

corresponding  pole  bandwidth  as  Bn  = — ^r.  By  repeating  the  two-step  procedure,  one 

can  terminate  the  search  when  the  first  four  formants  have  been  found  or  the  incremental 
frequency  is  over  5 kHz. 


In  summary,  the  Nb  method  is  to  sample  Nb(s)  with  specific  frequency 
increments,  check  for  changes  of  polarity.  Then  a linear  interpolation,  such  as  Newton’s 
method,  is  used  to  obtain  the  root  frequencies.  To  determine  the  final  pole  frequencies  of 
H(s),  the  derivatives  Na  and  Nb,  equation  (2.10),  are  approximated  by  finite  differences. 


25 


2-2-4  Estimate  the  Vocal  Tract  Cross-sectional  Area  from  the  Formant  Fregnen Hpq 

One  of  the  functions  of  the  articulatory  model  is  to  compute  the  articulatory 
information  (in  particular,  the  vocal  tract  cross-sectional  area)  from  the  acoustic 
information  (the  first  four  formant  frequencies  in  our  study)  that  are  obtained  from  the 
speech  signal.  In  general,  an  optimization  scheme  is  used  to  solve  this  speech  inverse 
problem.  The  optimization  scheme  varies  the  articulatory  parameters  iteratively  to 
achieve  a match  between  the  model-generated  and  the  desired  first  four  formants. 
Chapter  3 describes  an  optimization  scheme  called  simulated  annealing  in  detail. 

2.3  Acoustic  Models 

Basically,  the  acoustic  model  of  the  human  vocal  system  embodies  several 
submodels,  as  shown  in  Figure  2-6.  Both  the  vocal  tract  and  nasal  tract  models  simulate 
the  sound  propagation  in  these  tracts.  The  excitation  source  model  represents  and 
generates  the  voiced  excitation  waveforms  for  the  vocal  tract.  The  turbulent  air  flow  at  a 
constriction  for  fricatives  and  plosives  is  generated  by  the  noise  source  model.  The 
radiation  model  simulates  the  acoustic  energy  radiating  from  the  lips  and  the  nostrils. 

2.3.1  Vocal  Tract  Models 

The  vocal  tract  is  a bent,  three-dimensional  acoustic  tube  with  a slowly, 
time-varying  shape;  it  has  soft  wall  vibration,  viscous  friction  and  heat  conduction  losses, 
and  varying  boundaries  at  both  the  lips  and  glottis.  There  is  a nasal  side  branch  beginning 

at  the  top  of  the  pharynx,  of  fixed  dimensions  but  variable  coupling.  We  will  explicate  the 
nasal  tract  in  section  2.3.2. 

Preliminary  research  has  already  demonstrated  that  the  Navier-Stokes  description 
of  fluid  flow  has  the  feasibility  for  realistically  characterizing  the  nonlinearities  involved 
in  voiced-sound  generation  by  the  vocal  cords,  voiceless-fricative  generation  from 
turbulent  flow  at  constrictions,  and  resonance  and  radiation  effects  conditioned  by  sound 
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Glottal 

Parameters 


Area  Function 


Figure  2-6:  A basic  acoustic  model  of  the  articulatory  synthesizer. 
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propagation  in  a nonuniform,  lossy,  soft  wall  human  vocal  tract  (Thomas,  1986;  Hegerl 
and  Hoge,  1991;  Iijima  et  al.,  1992).  However,  the  results  have  been  limited  by  the 
extreme  computational  requirements  for  solving  the  time-dependent,  turbulent 
Navier-Stokes  equations  on  a dense  time-space  grid  for  realistic  geometric  configurations. 
This  limitation  suggests  the  need  for  a simplified  version  of  the  acoustic  model  of  the 
vocal  tract. 


For  a bent  vocal  tract  with  variable  cross-sectional  area,  the  computation  of  its 
resonances  (or  its  acoustic  transfer  function)  is  difficult.  Fortunately,  Sondhi  (1986)  has 
shown  that  the  shift  in  the  resonance  frequencies  below  4 kHz  is  in  the  range  of  2%-8% 
for  typical  dimensions  of  the  vocal  tract  when  it  is  straightened  out.  Thus,  the  vocal  tract 
can  be  represented  as  a straight  tube  of  varying  cross-sectional  area,  as  shown  in  Figure 
2-7(a),  but  of  fixed  shape  (circular  or  elliptic)  without  a loss  in  accuracy.  The  next 
assumption  is  plane  wave  propagation  along  the  axis  of  the  tube.  There  are  two  reasons 
that  make  this  assumption  reasonable.  First,  the  soft  tissue  along  the  vocal  tract  prevents 
radial  propagation  of  the  sound  wave.  Second,  the  average  lateral  (cross-sectional) 
dimension  of  the  vocal  tract  is  about  2.0  cm,  which  is  much  smaller  than  the  wavelength 
of  a sound  wave  at  4 kHz,  which  is  X = c/f  = 34,300/4,000  * 8.6  cm.  Strictly 
speaking,  this  assumption  is  valid  only  for  frequencies  below  4 kHz.  But  for  speech, 
where  5 kHz  is  considered  to  be  an  appropriate  bandwidth,  the  planar  propagation 
assumption  is  still  quite  adequate.  By  neglecting  the  losses  due  to  friction,  heat 
conduction,  and  yielding  wall  vibration,  a pair  of  equations  characterizing  the  wave 
propagation  in  the  vocal  tract  can  be  derived.  In  general  the  solutions  to  such  a pair  of 
equations  can  only  be  obtained  numerically.  Thus,  a further  approximation  is  needed.  A 
more  tractable  approach  is  to  represent  the  vocal  tract  as  a number  of  contiguous 
cylindrical  sections,  as  depicted  in  Figure  2-7 (b).  If  the  number  of  concatenated  sections 
is  large,  these  short-length  elemental  sections  provide  a stepwise  approximation  of  the 
continuous  area  function.  We  can  expect  that  at  resonant  frequencies  the  concatenated 
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tubes  are  indistinguishable  from  the  continuous  ones.  The  uniform  elementary  cylindrical 
section  is  practically  easy  to  treat.  Once  the  lossless  uniform  tube  has  been  analyzed,  the 
effects  of  losses  in  the  vocal  tract  can  be  accounted  for. 

2.3.1. 1 Sound  propagation 

The  linear  wave  motion  in  the  tract  is  governed  by  the  Law  of  Continuity  and 
Newton’s  Force  Law  (Morse  and  Ingard,  1968) 


If  planar  propagation  is  assumed,  then  all  particles  at  a given  displacement  x and  a 
specific  time  t will  have  the  same  velocity  independent  of  location  (y,  z)  within  the 
cross-sectional  area  A(x,  t).  Since  the  velocity  vector  points  in  a single  direction,  we  can 
drop  the  vector  notation  in  equations  (2.13a)  and  (2.13b).  Define  the  volume  velocity 
flow  at  position  x and  time  t as 


Applying  the  planar  propagation  assumption  and  substituting  the  volume  velocity 
definition  (equation  (2.14))  into  the  paired  equations  (2.13a,  b),  we  have  the  following 
new  paired  equations  (Rabiner  and  Schafer,  1978;  Deller  et  al„  1993): 


(2.13a) 


(2.13b) 


where  V represents  the  gradient,  V • is  the  divergence, 

p(x,  t)  is  the  variation  in  sound  pressure  in  the  tube  at  position  x and  time  t, 

v(x,  y,  z,  t)  is  the  particle  velocity  vector  inside  the  vocal  tract. 


c 


Q 


is  the  velocity  of  sound, 
is  the  density  of  air  in  the  tube. 


u(x,t)  = A(x,  t)  v(x,  t) 


(2.14) 


du(x,  t)  = i <3[p(x,t)A(x,t)]  aA(x,  t) 
dx  qc2  dt  at 


(2.15a) 


ap(x,t)  _ a[u(x,t)/A(x,t)] 

dx  Q at 


(2.15b) 
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There  is  no  closed  form  solution  except  for  the  simplest  configurations.  If,  however,  the 
cross-sectional  area  A(x,  t)  and  associated  boundary  conditions  are  specified,  numerical 
solutions  can  be  obtained.  One  method  of  simplifying  the  paired  equations  (2.15a,  b)  is  to 
construct  the  vocal  tract  by  a concatenation  of  uniform  lossless  sections,  as  depicted  in 
Figure  2-7(b). 

2.3. 1.2  Uniform  lossless  section 

Assume  that  the  vocal  tract  is  composed  of  uniform  elemental  sections  and 
each  section  has  cross-sectional  area  Ak  and  length  /k,  where  1 <k<  SN.  This  scheme 

corresponds  to  spatial  sampling,  with  /k  being  the  sampling  interval  for  the  k*  section. 

Consider  the  i*  section,  of  length  with  constant  cross-sectional  area  A,.  Define 
i-l 

xi—  i = ^ Ac’  which  represents  the  vocal  tract  length  from  the  glottis  to  the  section  i. 
k=  1 

Then  the  coupled  differential  equations  (2.15a,  b)  for  this  elemental  section  become 

_ dufo,  t)  Aj  dpfo,  t) 

dx  qq  2 (2.16a) 

dPi(x,t)  0 du^x.t) 

dx  ~ A;  at  (2.16b) 

where  u;(x,  t)  and  pj(x,  t)  are  the  volume  velocity  and  pressure,  respectively,  along  the  x 
axis  with  xi_1  < x < Xj.  The  solutions  to  equations  (2.16a,  b)  have  the  form  (Flanagan, 
1972;  Rabiner  and  Schafer,  1978;  Delleret  al.,  1993): 

ui(x,t)  = u + (t  - £)  - Ui"(t  + f)  (2.17a) 

pi(x’t}  = ^[ui+(l  ~ t)  + ur(t  + f)]  (2.17b) 

where  Uj+(t  - §)  and  uf(t  + ^ indicate  forward  (transmitted)  and  backward  (reflected) 
traveling  waves,  respectively.  The  boundary  conditions  at  both  ends  of  each  section 
determine  the  relationship  between  the  traveling  waves  in  adjacent  sections.  They  are 


30 


derived  from  the  physical  principle  that  pressure  and  volume  velocity  must  be  continuous 
in  both  time  and  space  everywhere  in  the  tract.  Refer  to  the  textbook  of  Rabiner  and 
Schafer  (1978)  for  details. 

2. 3. 1.3  Approaches  for  vocal  tract  simulation 

Based  on  the  above  analysis,  there  are  two  approaches  used  for  vocal  tract 
simulation. 

Wave  propagation  approach:  This  approach  is  based  on  the  analytical  solutions  of 
equations  (2.17a,  b)  for  a lossless  elemental  uniform  section.  The  pressure  at  any  point 
within  the  section  is  considered  to  be  made  up  of  two  components,  a forward  wave  and  a 
backward  wave.  At  the  junction  of  two  cylindrical  sections  with  different  cross-sectional 
areas  and  lengths  (see  Figure  2-8),  each  wave  has  a forward  propagation  and  backward 
reflection.  By  defining  the  reflection  coefficient  r,  considering  the  propagation  delay  t, 
applying  the  continuity  conditions  at  each  junction,  and  accounting  for  the  losses  at  the 
glottis  and  lips  as  boundary  conditions,  the  signal  flow  graph  and  the  equivalent 
discrete-time  system  can  be  obtained.  An  example  is  depicted  in  Figure  2-9.  See  Rabiner 
and  Schafer  (1978)  for  detailed  derivations. 

This  approach  was  first  used  by  Kelly  and  Lochbaum  (1962)  for  speech  synthesis 
and  has  been  called  the  Kelly-Lochbaum  model  or  lattice  structure.  A more  elegant 
realization  is  given  by  means  of  wave  digital  filters  (WDF)  (Fettweis  and  Meerkotter, 
1975;  Strube,  1982;  Meyer  and  Strube,  1984;  Liljencrants,  1985;  Meyer  et  al„  1989).  The 
WDF  has  been  implemented  in  special  hardware  for  real-time  synthesis  (Meyer  et  al., 
1989).  The  neglection  of  losses  and  a fixed  vocal  tract  length  are  the  major  drawbacks  of 
this  approach.  By  varying  the  sampling  rate,  the  dynamic  variation  of  vocal  tract  length 
can  be  simulated  (Wright  and  Owens,  1993).  Some  progress  has  been  made  in 
incorporating  losses  (Liljencrants,  1985;  Meyer  et  al.,  1989),  but  more  investigation  is 
required. 
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P”(k  + l,t  + rk+1) 

- rp-(k  + l,t  - xk+ j) 

(1  - r)p+(k  + l,t  - tk+1) 


Figure  2 8.  Reflection  relationships  at  the  junction  between  two  lossless  sections. 
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Figure  2-9:  Discrete-time  lossless  vocal  tract. 

(a)  Signal  flow  graph  for  lossless  tube  model  of  the  vocal  tract; 

(b)  The  equivalent  discrete-time  system. 
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Transmission-line  approach  : Transmission-line  analogs  of  the  vocal  tract  (or 
equivalent  electrical  circuit  model)  is  based  on  the  similarity  between  the  acoustic  wave 
propagation  in  a cylindrical  tube  and  the  propagation  of  an  electrical  wave  along  a 
transmission  line.  The  derivation  from  the  basic  equations  of  acoustic  wave  propagation 
to  an  equivalent  electrical  quadripole  representation  is  well  known  (Fant,  1960;  Flanagan, 
1972;  Linggard,  1985).  The  analogs  are  summarized  in  Table  2-1.  Figure  2-10  is  an 
equivalent  circuit  representation  of  a soft-wall,  lossy  cylindrical  tube.  The  series  resistor 
R is  used  to  represent  the  acoustic  loss  due  to  viscous  drag  in  which  the  energy  loss  is 
proportional  to  the  square  of  the  volume  velocity.  The  shunt  conductance  G represents  the 
loss  due  to  heat  conduction,  which  is  proportional  to  pressure  squared.  The  shunt 
impedance  Zw  is  the  acoustic  equivalent  mechanical  impedance  of  the  yielding  wall.  This 
wall  impedance,  which  represents  a mass-compliance- viscosity  loss  of  the  soft  tissue,  has 
three  components,  Rw,  Lw,  and  Cw.  We  will  describe  the  wall  impedance  in  the  next 
subsection.  Table  2—2  lists  the  physical  definitions  of  all  the  circuit  components  in  Figure 
2-10.  Note  that  both  R and  G are  a function  of  frequency. 

2. 3. 1.4  Wall  impedance 

The  pressure  variation  inside  the  vocal  tract  causes  the  cross-sectional  area  to 
change,  since  it  exerts  a varying  force  on  the  tract’s  elastic  walls.  Assume  that  the  walls 
are  locally  reacting  and  the  resulting  cross-sectional  area  variation  is  small,  i.e., 

A(x,  t)  = A0(x,  t)  + AA(x,  t) 

= A0(x,t)  + y(x,  t)  S0(x,  t)  (2.18) 

where  A0(x,t)  is  the  nominal  area,  AA(x,t)  is  a small  variation,  y(x,t)  is  the  yielding 
amount  of  the  walls,  and  S0(x,  t)  is  the  circumference  of  the  tract  (Rabiner  and  Schafer, 
1978,  Maeda,  1982a).  The  wall  vibration  is  modeled  as  a mass-compliance-viscosity 
mechanical  model.  The  pressure  variation  is  governed  by  the  following  differential 


equation 
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Table  2-1 : Acoustical  / electrical  analogues 

Acoustical  parameter 

Electrical  parameter 

p - Pressure 

v - Voltage 

u - Volume  velocity 

i - Current 

Q/A  - Air  mass  inertia  (acoustic  inductance) 

L - Inductance 

A/(qc2)  - Air  compressibility  (acoustic  capacitance) 

C - Capacitance 

Viscous  loss 

R - Series  resistance 

Heat  conduction  loss 

G - Shunt  resistance 

Yielding  wall 

Zw  - Shunt  impedance 

Figure  2-10:  An  equivalent  circuit  representation  of  a lossy  cylindrical  tube. 
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Table  2-2:  Physical  definitions  of  the  components  in  Figure  2-10 

(Based  on  Wakita  and  Fant,  1978). 


R = S 

2j2A2 


c = A;i 

QCZ 


= (r]  - 1)S 


QC^ 


rw  = 4 

s2/ 


L = JSL 

W S 2/ 


2?Q 


- si/ 

k 


r _ 0*7 

'-w  — ~ 


; Series  resistance 

; Series  inductance 
; Shunt  capacitance 

; Shunt  conductance 
; Resistance  in  wall  impedance 
; Inductance  in  wall  impedance 
; Capacitance  in  wall  impedance 


where 

S = 2Sa  JAjx,  : circumference  of  element 
SA: 


/ : 
A : 

Q ■ 
c : 

p : 

X : 

T1  : 

I : 

co 


section  shape  factor,  for  a circular  cross-section,  SA=1; 
for  an  elliptic  cross-section,  SA=2. 
length  of  elemental  tube, 
cross-sectional  area  of  element. 

density  of  air,  1.14  x 10-3  gm/cm3  (moist  air  at  body  temperature,  37  °C). 

sound  velocity,  3.53  x 104  cm/sec  (moist  air  at  body  temperature,  37  °C). 

viscosity,  1.86 X KT4  dyne-sec/cm2  (20°C,  0.76  m.Hg). 

coefficient  of  heat  conduction  of  air,  0.055  x 10-3  cal/cm-sec-deg  (0°C). 

adiabatic  gas  constant,  1.4. 

specific  heat,  0.24  cal/gm-degree  (0°C,  1 atmos.). 

radian  frequency. 
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S0(x,  t)  p(x,  t)  = m 


<32y(x,t) 

at2 


+ b 


dy(x,  t) 

at 


+ k y(x,t) 


(2.19) 


where  m,  b,  and  k are  the  mass,  viscosity,  and  compliance,  respectively,  of  the  wall  per 
unit  length  of  the  tract  (Maeda,  1982a).  Define  the  volume  velocity  generated  by  the  wall 
vibration  as 


..  f..  n _ a[y(x>1)  so(M)/] 

uwVA»  L)  ^ 


(2.20) 


where  / is  the  length  of  tract.  By  substituting  equation  (2.20)  into  the  equation  (2.19),  the 
wall  vibration  can  be  rewritten  as 

a _ m a2uw(x,t)  , b auw(x,t)  , ic 
**  ’ s2(x,t)/  at2  s2(x,t)/  at  + s2(x,t)/Uw(x,t)  (2-21) 

For  an  elemental  uniform  section,  equation  (2.21)  is  simplified  to 

nr„  a _ m d2uw(x,t)  b <5uw(x,t)  , k , 

p(x’t)  ■ si;-!?- + — + ss?"^'0 


= L a2u*(x>t)  + R 3uw(x,t)  _1_  n 

w at2  Rw  at  +c„,Uw(x,t)  (2-22) 


m b S0Z 

where  Lw  - Rw  = — and  Cw  = — are  the  components  of  the  wall  vibration 
impedance  (see  Figure  2-10  and  Table  2-2). 


The  wall  impedance  can  be  either  included  in  every  elemental  section  of  the  vocal 
tract  as  a distributed  element  (Flanagan,  1972;  Flanagan  and  Ishizaka,  1976;  Flanagan  et 
al.,  1975,  1980;  Ishizaka  et  al.,  1975;  Maeda,  1982a)  or  inserted  as  a lumped  shunt 
element,  one  in  the  pharynx  and  one  at  the  level  of  the  cheek  (Wakita  and  Fant,  1978; 
Badin  and  Fant,  1984;  Lin,  1990).  As  Wakita  and  Fant  (1978)  indicated,  the  lumped  wall 
impedance,  which  is  independent  of  the  vocal  tract  configurations  may  not  give 
satisfactory  results.  The  distributed  wall  impedance  is  used  in  the  present  study. 


Table  2-3  presents  data  concerning  the  wall  mass,  viscosity,  and  compliance  found 
in  the  literature.  In  some  cases,  the  compliance  was  not  used  since  it  has  virtually  no 
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effect  on  the  resonances  of  the  model  (Wakita  and  Fant,  1978).  The  data  measured  by 
Ishizaka  et  al.  (1975)  are  used  in  our  study.  As  Maeda  (1982a)  pointed  out,  the  total  mass 
of  the  walls  may  vary  unrealistically  if  the  yielding  wall  parameters  are  specified  in  terms 
of  a unit  surface  area.  Thus,  the  per  unit  length  specification  was  used  in  his  vocal  tract 
simulation.  We  follow  Maeda’s  specification. 

2-3.2  Nasal  Tract  and  Sinus  Cavities 

The  nasal  tract  constitutes  a side  branch  of  the  vocal  tract.  The  velopharyngeal 
port  controls  the  coupling  between  these  two  tracts  for  producing  certain  sounds.  A 
general  rule  is  that  when  the  opening  area  is  smaller  than  20  mm2  there  is  no  apparent 
nasality.  A wider  opening  produces  nasal  resonance,  and  speech  is  definitely  perceived  as 
nasal  when  the  area  approaches  50  mm2  (Borden  and  Harris,  1980).  In  our  articulatory 
model,  the  opening  area  of  the  velopharyngeal  port  is  simulated  by  lowering  the  velum 
along  a line  segment,  as  mentioned  in  section  2.2.1. 

The  nasal  tract  has  two  channels  at  the  nostrils.  Usually,  an  acoustically 
approximated  single  tract  is  used  owing  to  its  quasi-symmetrical  profile.  The  minor 
errors  due  to  this  approximation  have  been  analyzed  by  Lin  (1990).  Figure  2-11  shows 
the  area  function  of  the  nasal  tract  used  by  Maeda  (1982b),  where  the  nasal  tract  is 
assumed  to  be  1 1 cm  long  and  consists  of  1 1 elemental  uniform  sections.  Generally  the 
nasal  tract  has  a fixed  structure  except  for  the  first  few  sections,  indicated  by  a dashed  line 
in  Figure  2-11,  where  the  area  varies  with  the  velopharyngeal  port  opening.  Maeda 
(1982b)  used  linear  interpolation  to  interpolate  the  areas  (the  second  and  third  sections) 

between  the  coupling  section  (the  first  section)  and  the  first  fixed  section  (the  fourth 
section). 

From  sweep  frequency  measurements  of  the  acoustic  transfer  function 
(Lindqvist-Gauffin  and  Sundberg,  1976)  and  simulation  studies  (Maeda,  1982b;  Fant, 
1985;  Lin,  1990)  of  the  nasal  tract,  it  has  been  found  that  the  nasal  sinuses  have  to  be 
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considered  as  a part  of  the  acoustic  system.  In  a model  of  speech  production, 
Lindqvist-Gauffin  and  Sundberg  (1976)  indicated  that  at  least  two  shunting  cavities,  the 
sinus  maxillares  and  the  sinus  frontales,  must  be  added  to  improve  the  nasal  quality. 
Since  the  opening  area,  which  couples  the  sinus  cavities  to  the  nasal  tract,  is  rather  small, 
the  sinus  cavities  can  be  regarded  as  Helmholtz  resonators.  According  to  the 
Lindqvist-Gauffin  and  Sundberg  (1976)  study,  a reasonable  estimate  of  the  resonant 
frequencies  would  be  200-800  Hz  for  the  maxillary  sinuses  and  500-2000  Hz  for  the 
frontal  sinuses.  The  effect  of  the  sinus  resonance  on  the  acoustic  system  is  modeled  as  a 
shunt  circuit  element  (Fant,  1985),  as  shown  in  Figure  2-12,  and  the  resonance  can  be 
tuned  to  the  required  frequency.  Fant  (1985)  inserted  the  sinus  maxillares  and  frontales  at 
positions  6 cm  and  8 cm  from  the  nostrils,  respectively.  The  two  sinuses  are  tuned  to 
resonate  at  500  Hz  and  at  1400  Hz  respectively.  However,  Maeda  (1982b)  inserted  only 
the  sinus  maxillares  at  a position  4 cm  from  the  nostrils  and  showed  that  the  quality  of  all 
nasalized  vowels  was  satisfactory.  Table  2-4  lists  data  for  the  shunt  circuit  components 
used  in  the  literature  (Sondhi  and  Schroeter,  1987;  Lin,  1990). 

To  investigate  the  effects  of  the  nasal  tract  and  sinus  cavities,  our  software  system 
provides  the  user  with  a method  to  vary  the  nasal  tract  structure  (area  and  length),  to 
assign  the  number  of  coupled  sinus  cavities,  and  to  change  the  circuit  component  values 
and  the  coupling  position  of  each  sinus. 

2-3.3  Radiation  Models  of  Lips  and  Nostrils 

Acoustic  energy  escapes  from  the  vocal  tract  via  the  lips.  From  the 
transmission-line  analogs,  the  lips  are  treated  as  a radiation  impedance  that  loads  the  vocal 
tract.  The  radiation  impedance  contains  a resistive  part  that  represents  acoustic  energy 
loss  and  a reactance  part  that  represents  the  mass  inertia  of  air  at  the  lips  (Fant,  1960). 
Radiation  from  a spherical  baffle  is  one  model  for  the  radiation  impedance  that  is 
represented  by  nonlinear  functions  (Morse,  1948;  Morse  and  Ingard,  1968).  Stevens  et  al. 
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(1953)  made  approximations  and  represented  the  radiation  impedance  by  a resistive  load 
and  three  other  frequency-dependent  components.  Fant  made  another  approximation  and 
modeled  the  impedance  by  two  frequency-dependent  components,  one  being  resistive  and 
the  other  inductive  (Fant,  1960;  Wakita  and  Fant,  1978). 

Another  simplified  radiation  model  is  to  assume  that  the  radiating  surface  is  set  in 
a plane  baffle  of  infinite  extent.  In  this  case,  the  radiation  impedance  is  formed  by  a first 
order  Bessel  function  and  Struve  function  (Rayleigh,  1945;  Flanagan,  1972;  Wakita  and 
Fant,  1978).  Flanagan  (1972)  provided  a good  approximation  to  this  complicated 
representation  by  a parallel  connection  of  a resistance  and  an  inductance.  The  most 
important  feature  of  Flanagan’s  model  (1972)  is  that  both  circuit  components  are 
frequency  independent.  Figure  2-13  illustrates  the  Stevens  et  al.  (1953)  model  and 
Flanagan  (1972)  model. 

Comparisons  between  models  have  been  made  by  researchers  (Wakita  and  Fant, 
1978;  Badin  and  Fant,  1984;  Lin,  1990).  The  Stevens  et  al.  (1953)  model  yields  the  most 
accurate  result.  However,  the  Flanagan  (1972)  model  is  usually  preferred  for 
time-domain  synthesis  (Flanagan  and  Ishizaka,  1976;  Flanagan  et  al.,  1975,  1980;  Maeda, 

1982a)  and  is  used  in  our  synthesis  model.  The  same  radiation  model  is  used  for  the 
nostrils. 


The  relationship  between  volume  velocity  at  the  lips  or  nostrils  and  the  radiated 
pressure  at  a distance  d cm  from  the  lips  or  nostrils  is  given  by  (Fant,  1960) 


Pr(faQ 

Ur(C0) 


PCD 

= 


(2.23) 


The  factor  K^cd)  is  a smooth  high  frequency  emphasis.  Due  to  a lack  of  experimental 
verification,  K-^cd)  is  generally  set  to  unity  and  the  relationship  is  essentially  a 
differentiation  (Badin  and  Fant,  1984). 
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Figure  2-13:  Radiation  models. 

(a)  Stevens  et  al.  (1953)  approximation  for  an  orifice  in  a sphere. 

(b)  Parallel  connection  of  Flanagan  model  (1972). 
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Figure  2-14:  The  LF  model  for  the  differential  glottal  waveform. 
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2.3.4  Excitation  Source  Models 

Basically,  there  are  two  kinds  of  speech  sounds.  One  is  voiced,  which  involves 
quasi-periodic  vibrations  of  the  vocal  folds.  The  other  is  unvoiced,  which  involves  the 
generation  of  turbulence  noise  by  the  rapid  flow  of  air  past  a narrow  constriction.  In  the 
case  of  voiceless  speech,  the  excitation  waveform  appears  somewhat  like  a random  noise 
source,  which  we  will  discuss  in  section  2.3.6.  For  voiced  speech,  the  excitation  source  is 
a quasi-periodic  pulse  train  located  at  the  glottis. 

2.3.4. 1 Excitation  at  the  glottis 

In  the  case  of  voiced  speech,  the  conventional  LPC  methods  use  only  an  impulse 
train  as  the  excitation,  which  does  not  generate  natural  sounds  (Childers  and  Wu,  1990). 
It  is  well-known  that  the  “naturalness”  of  synthetic  speech  is  closely  related  to  the  shape 
of  the  glottal  pulse  (Rosenberg,  1971;  Holmes,  1973;  Klatt  and  Klatt,  1990;  Childers  and 
Wu,  1990;  Childers  and  Lee,  1991).  We  do  not  yet  have  a complete  understanding  of  the 
phonatory  behavior  of  the  vocal  folds.  Thus,  we  lack  an  efficient  model  for  the  voice 
source.  However,  several  models  capable  of  describing  the  major  characteristics  of  the 
glottal  flow  have  been  proposed.  They  can  be  classified  into  two  major  categories; 
interactive  and  non-interactive  models  (Fujisaki  and  Ljungqvist,  1986). 

In  the  interactive  models,  there  are  two  approaches  to  generate  the  glottal  volume 
velocity.  For  the  method  known  as  the  non-physical  approach,  the  glottal  flow  is 
calculated  by  modeling  the  glottal  area  (Ananthapadmanabha  and  Fant,  1982;  Titze,  1984; 
Allen  and  Strong,  1985;  Pinto  et  al.,  1989)  or  conductance  (Rothenberg,  1981)  function 
and  by  incorporating  the  various  impedances  of  the  acoustic  system  into  the  model.  For 
the  method  known  as  the  physical  approach,  structural  modeling  of  the  mechanical 
vibration  of  the  vocal  cords  (Flanagan  and  Landgraf,  1968;  Ishisaka  and  Flanagan,  1972; 
Titze,  1973)  or  a kinematic  model  for  the  3-D  glottis  (Titze,  1989)  has  been  attempted. 
The  need  to  know  the  details  of  the  physical  characteristics  of  the  various  parts  of  the 
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vocal  cords  is  the  major  drawback  of  the  interactive  models.  Futhermore,  the 
computational  burden  for  such  models  is  high. 

In  contrast,  the  non-interactive  models  directly  parameterize  the  glottal  flow  or 
flow  derivative  function.  If  the  parameters  are  sufficient  to  represent  the  glottal 
waveform,  it  may  be  possible  to  reconstruct  the  waveform  from  a given  set  of  parameters. 
Therefore,  the  parameterization  provides  a method  for  generating,  classifying,  and  storing 
a large  number  of  glottal  waveforms  for  various  voicing  conditions.  A number  of 
non-interactive  models  exist  in  the  literature  (Rosenberg,  1971;  Fant,  1979; 
Ananthapadmanabha,  1982;  Fant  et  al.,  1985;  Fujisaki  and  Ljungqvist,  1986;  Klatt  and 
Klatt,  1990).  The  Liljencrants-Fant  (LF)  model  (Fant  et  al.,  1985)  is  often  used  because: 
(1)  it  is  preferred  by  listeners  when  they  evaluate  synthesized  speech  (Childers,  1991; 
Eggen,  1992);  and  (2)  it  has  been  shown  to  be  superior  to  other  models  of  the  same 
complexity  (Fant  et  al.,  1985;  Fujisaki  and  Ljungqvist,  1986).  The  LF  model  requires 
four  parameters  for  modeling  the  differential  glottal  waveform  (see  Figure  2-14).  For  a 
more  detailed  description  of  its  properties  and  implementation  refer  to  Fant  (1986,  1988, 
1993),  Fant  and  Lin  (1988,  1989),  and  Lin  (1990).  Another  advantage  of  the  LF  model  is 
that  parameters  of  the  model  can  be  measured  or  estimated  from  the  inverse  filtered 
speech  and  the  EGG  signal  or  from  the  inverse  filtered  speech  signal  only  (Lee,  1988; 
Childers  and  Lee,  1991). 

It  is  well  known  that  “jitter,”  the  aperiodicity  of  the  fundamental  frequency  F0  of 
voicing  (Horii,  1979),  and  “shimmer,”  the  period-to-period  random  fluctuations  in 
glottal-pulse  amplitude  (Horii,  1980),  also  contribute  to  a natural  sounding  voice.  Klatt 
and  Klatt  (1990)  included  a slow  quasi-random  drift  called  “flutter”  into  the  voicing 
source  model  to  simulate  jitter  (pitch  perturbation)  but  did  not  include  a shimmer  model. 
Lalwani  and  Childers  (1991)  proposed  a unified  glottal  excitation  model  that  includes  the 
pitch  perturbation  model  with  rate  of  perturbation  control  and  the  aspiration  noise  model 
with  amplitude  modulation  into  the  LF  model.  This  unified  model  has  the  capability  to 
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include  the  “shimmer,”  i.e.,  the  amplitude  perturbation  model.  We  use  a simplified 
version  of  this  unified  model  as  a non-interactive  glottal  excitation  model  in  our  study. 
Figure  2-15  illustrates  the  block  diagram  of  such  a simplified  excitation  model. 

2. 3.4.2  Excitation  in  the  vocal  tract 

The  normal  speaker  routinely  phonates  using  the  vocal  folds.  Unfortunately,  over 
1.5  million  non-speaking  persons  in  the  USA,  excluding  some  deaf  individuals  (Klatt, 
1987),  can  not  phonate  using  the  vocal  folds.  Although  airstream  activated 
aerodynamic-mechanical  devices  can  aid  the  vocally  handicapped,  they  produce  no  sound 
until  pulmonary  air  is  diverted  through  them  (Hilgers  and  Schouwenburg,  1990).  In 
1980,  an  electrically  driven  intraoral  artificial  larynx  was  invented  (Lowry,  1981).  This 
new  speech  prosthesis  consists  of  a small  speaker  (with  battery)  and  a resonator  hom, 
which  are  joined  to  a dental  plate  and  placed  in  the  oral  cavity  of  the  subject  (Myrick  and 
Yantomo,  1993).  With  such  a device  the  subject  can  produce  intelligible  speech,  although 
the  quality  is  still  inferior. 

Since  this  electrical-driven  speech  prosthesis  must  be  placed  in  the  vocal  tract,  it  is 
reasonable  to  expect  that  the  driving  point  acoustic  transfer  function  is  different  from  that 
seen  by  the  glottis.  Thus,  the  excitation  signal  must  have  a different  waveform  from  the 
glottal  pulse  to  produce  the  same  speech  sounds.  Myrick  and  Yantomo  (1993)  presented 
the  vocal  tract  frequency  response  when  the  excitation  is  located  at  the  sixth  section  of  a 
ten-section  vocal  tract  by  using  the  Kelly-Lochbaum  model  (1962)  and  the  lossless 
transmission-line  analog  model  (Flanagan,  1972). 

Our  software  system  provides  several  advanced  features  for  the  user  to  investigate 
the  properties  when  the  excitation  is  located  in  the  vocal  tract.  They  are 

H]  the  excitation  can  be  located  at  any  section  of  a sixty-section  vocal  tract  with 
soft-wall  vibration,  thermal,  and  heat  conduction  losses, 
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Figure  2-15:  The  simplified  excitation  model  of  Lalwani  and  Childers  (1991). 
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[2]  the  nasal  tract  with  or  without  sinus  cavities  can  be  coupled  into  the  vocal  tract 
by  varying  the  opening  area  of  the  velopharyngeal  port, 

0]  the  subglottal  system  can  be  coupled  to  the  vocal  tract  when  the  glottis  is 
opened. 

2.3.5  Glottal  Impedance  and  Subglottal  Models 

According  to  the  classical  formulation  of  the  acoustic  theory  of  speech  production 
(Fant,  1960;  Flanagan,  1972),  the  voicing  source  is  characterized  as  a current  source.  This 
assumes  that  the  glottal  waveform  depends  very  little  on  the  shape  or  impedance  of  the 
vocal  tract.  Similarly,  the  vocal  tract  is  modeled  by  a time-invariant  linear  filter  since  the 
glottal  impedance  is  assumed  much  higher  than  the  vocal  tract  impedance.  However, 
recent  work  by  several  researchers  has  shown  that  there  does  exist  a certain  degree  of 
dependency  of  the  glottal  flow  on  the  load  of  the  vocal  tract  and  the  subglottal  cavities.  A 
number  of  major  interaction  consequences  have  been  identified.  They  are: 

CO  Ft  (first  formant)  ripple  in  the  source  waveform.  One  may  often  observe  a 
“hump”  in  the  rising  portion  of  the  glottal  volume  velocity  waveform  obtained 
by  using  inverse  filtering  (Childers  and  Wu,  1990). 

C3  Nonlinear  Ft  — F0  interaction.  The  pharyngeal  pressure  standing  waves  may 
have  a nonlinear  effect  that  cause  an  increase  in  the  glottal  source  strength 
whenever  F,  is  near  an  integral  multiple  of  F0  (Ananthapadmanabha  and  Fant, 
1982). 

0]  Truncation  of  the  F1  damped  sinusoid.  The  time-varying  glottal  impedance 
affects  the  vocal  tract  transfer  function  primarily  by  increasing  the 
first-formant  bandwidth,  which  leads  to  a truncation  of  the  Fj  damped 
sinusoid  when  the  glottis  is  open  (Ananthapadmanabha  and  Fant,  1982). 
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H Pulse-skewing.  The  inertive  loading  by  the  sub-  and  supraglottal  acoustic 
systems  results  in  a skewing  to  the  right  of  the  glottal  pulse  (Rothenberg, 
1981). 

Although  synthetic  speech  can  be  synthesized  by  using  a simple  non-interactive 
excitation  model,  it  seems  to  be  essential  that  a glottal  excitation  model  used  in  the 
articulatory  speech  synthesizer  should  reproduce  the  variations  in  the  acoustic  features  of 
the  excitation  more  naturally  (Sondhi  and  Schroeter,  1987).  Interactive  physical  models 
are  hard  to  implement  since  one  needs  the  physiological  characteristics  of  the  vocal  folds, 
and,  furthermore,  most  models  are  computationally  inefficient.  On  the  other  hand, 
interactive  non-physical  models  are  attractive  for  researchers.  Klatt  and  Klatt  (1990) 
included  the  ability  to  change  the  first-formant  bandwidth  pitch-synchronously  to 
simulate  the  interaction  between  source  and  vocal  tract  in  their  formant  synthesizer.  For 
the  articulatory  synthesizer,  a prescribed  glottal  area  time  function  is  usually  used  for 
source-tract  interaction. 

The  glottal  area  is  defined  as  the  opening  between  the  vocal  folds.  It  is 
time-varying  during  voiced  phonation  and  quasi-steady  for  voiceless  phonation.  One 
possible  method  to  obtain  the  time-varying  glottal  area  is  from  ultra  high  speed  films 
(Moore  and  Childers,  1983;  Childers  and  Larar,  1984;  Childers  et  al.,  1984;  Childers  and 
Knshnamurthy,  1985;  Childers  et  al.,  1990).  The  area  function  inferred  in  this  manner  is 
the  projected  area,  i.e.,  the  minimum  area  of  the  glottis.  Figure  2-16  shows  three 
measured  glottal  area  waveforms  from  ultra  high  speed  films,  where  one  can  see  that  the 
projected  glottal  area  tends  to  have  a roughly  triangular  shape  that  is  slightly  skewed  to 
the  left.  The  sharp  peak  of  the  glottal  area  function  usually  results  in  the  excitation  at  the 
apex  being  exaggerated  (Lin,  1990).  On  the  other  hand,  Cranen  and  Boves  (1987)  have 
derived  the  glottal  area  from  a vertically  uniform  glottis  by  using  the  two-mass  model 
(Ishizaka  and  Flanagan,  1972).  The  glottal  area  function  derived  in  this  manner  is  called 
the  effective  glottal  area  function  and  does  not  coincide  with  the  projected  glottal  area. 
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Figure  2-17:  Glottal  area  models. 

(a)  Triangle,  (b)  Sine,  (c)  Raised  cosine. 
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However,  some  simple  functions  such  as  a triangle,  a sine,  and  a raised  cosine  are  used  to 
model  the  glottal  area  (Ananthapadmanabha  and  Fant,  1982).  Figure  2-17  illustrates  the 
glottal  area  waveforms  modeled  by  triangular,  sine,  and  raised-cosine  functions.  Our 
software  program  implementation  provides  these  three  functions  as  the  options  for 
modeling  the  glottal  area.  In  addition,  for  triangular  and  raised  cosine  functions,  the 
opening  and  closing  durations  can  be  specified  to  model  the  glottal  area  skewing. 


The  time-varying  glottal  impedance  is  determined  by  the  time-varying  glottal  area 
function.  It  contains  a resistance  and  an  inductance.  Assume  that  the  glottis  is  modeled  as 
a rectangular  slit  with  Ag,  /g,  and  d as  the  area,  length,  and  thickness,  respectively.  Then 
the  glottal  inductance  is  given  by 

L 

8 Ag  (2.24) 

The  resistance  of  the  glottis,  according  to  a van  dan  Berg  et  al.  (1957)  experiment,  is 
formulated  by 


Rg 


12^d/g  , Qug 

A3  +k8'2A  l 


(2.25) 


where  p is  the  viscosity  of  air,  kg  is  a coefficient,  g is  the  density  of  air  in  the  tube,  us  is 

o 

glottal  volume  velocity.  Some  typical  values  of  kg  used  in  the  literature  are  0.875  (van 

dan  Berg  et  al.,  1957),  0.9  (Stevens,  1971),  1.1  (Ananthapadmanabha  and  Fant,  1982), 
and  1.38  (Maeda,  1982a). 


The  subglottal  system,  which  includes  the  tracheal  tube  and  lungs  is  usually 
omitted  in  vocal  tract  simulation,  since  its  effect  on  speech  spectra  is  assumed  to  be  minor, 
except  for  unvoiced  sounds,  where  the  glottis  is  large  (Ishizaka  et  al.,  1976).  However, 
when  the  glottal  opening  is  large,  which  means  the  glottal  impedance  is  no  longer  very 
high,  the  coupling  between  the  subglottal  system  and  the  vocal  tract  is  not  negligible. 
From  measurements  of  laryngectomized  subjects,  Ishizaka  et  al.  (1976)  have  measured 
the  acoustic  input  impedance  of  the  subglottal  system.  Ananthapadmanabha  and  Fant 
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(1982)  used  the  Ishizaka  et  al.  (1976)  experimental  data  and  represented  the  subglottal 
system  as  three  cascaded  RLC  resonance  modules  called  the  Foster-chain  circuit.  Figure 
2-18  shows  the  circuit  and  the  corresponding  component  values.  The  subglottal  formants 
were  located  at  640,  1335,  and  2110  Hz,  with  the  corresponding  bandwidths  of  246,  155, 
and  140  Hz.  The  effects  of  the  Foster-chain  subglottal  model  on  the  vocal  tract  formants 
and  bandwidths  have  been  analyzed  (Ananthapadmanabha  and  Fant,  1982;  Badin  and 
Fant,  1984;  Lin,  1990).  The  summary  is  that  the  acoustic  effect  of  the  subglottal  system  is 
small,  except  for  unvoiced  sounds,  where  the  glottal  opening  is  fairly  large. 

Combining  the  simplified  Lalwani  and  Childers  excitation  source  model  (1991), 
Ananthapadmanabha  and  Fant  glottal  area  model  (1982),  and  Foster-chain  subglottal 
model,  we  propose  an  interactive  excitation  model,  shown  in  Figure  2-19. 

2.3.6  Noise  Source  Models 

When  there  is  a flow  of  air  through  a constriction  or  past  an  obstruction, 
turbulence  is  created  (Stevens,  1971,  1993a,  1993b;  Shadle,  1991).  The  random  velocity 
fluctuations  in  the  flow  can  act  as  a source  of  sound  called  turbulence.  Three  types  of 
consonants  produced  in  this  manner  are  fricatives,  stops  (plosives),  and  affricates. 
Fricatives  are  generated  with  the  turbulent  flow  excitation  located  in  the  region  of  a 
constriction  in  the  vocal  tract.  Plosives  are  produced  by  making  a complete  closure  of  the 
tract,  building  up  pressure  and  abruptly  releasing  it.  The  stop  release  is  frequently 
followed  by  a turbulence  noise  excitation.  Affricates  are  dynamic  sounds  that  can  be 
modeled  as  the  concatenation  of  a stop  and  a fricative.  A special  phoneme  /h/,  called 
aspirate,  is  produced  with  turbulent  flow  through  the  glottis.  Refer  to  Broad  (1977a)  and 

Borden  and  Hams  (1980)  for  more  details  on  the  generation  of  the  unvoiced  speech 
sounds. 

Under  the  plane  wave  assumption,  the  sound  pressure  of  turbulent  flow  can  be 
taken  as  proportional  to  the  square  of  the  volume  velocity  of  the  airflow  and  inversely 
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Figure  2-18:  Foster-chain  circuit  model  for  the  subglottal  system. 
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proportional  to  the  constriction  area,  Ac  (Stevens,  1971).  The  location  of  the  turbulence 
noise  source  may  be  located  at  the  center  of,  or  immediately  downstream  from  the 
constriction  region,  or  possibly  at  a combination  of  these  places,  or  spatially  distributed 
along  the  constriction  region  (Fant,  1960;  Flanagan  and  Cherry,  1968;  Stevens,  1971, 
1993a,  1993b;  Lin,  1990).  The  spectrum  of  the  turbulence  noise  is  broadly  distributed 
over  a wide  range  of  frequencies  (2-8  KHz)  with  some  accentuation  in  the  mid-audio 
range  (Stevens,  1971,  1993a,  1993b;  Childers  and  Lee,  1991). 

Basically,  the  noise  source  model  defines  the  characteristics  of  the  noise  source  as 
a function  of  the  airflow  through  the  constriction  and  of  the  constriction  cross-sectional 
area  Ac.  Meyer-Eppler  (1953)  (Broad,  1977b)  found  that  the  rms  sound  pressure,  Prms,  of 
the  noise  could  be  expressed  as 

Prms  = Ac(Rg  — R^,)  (2.26) 

where  Re  is  the  Reynolds  number  and  RK  is  the  critical  Reynolds  number.  Fant  (1960) 
adopted  a serial  noise  pressure  source  and  reformulated  the  as  a function  of  the 
pressure  drop  through  the  constriction  and  the  effective  width  of  the  constriction.  Lin 
(1990)  extended  Fant’s  model  to  include  the  frictional  and  turbulent  losses  inside  the 
constriction.  Both  Fant  and  Lin  tried  to  reconstruct  the  fricative  spectra  from  area 
functions  by  using  the  acoustic  transfer  function.  However,  some  fricatives  have  been 
modeled  quite  successfully  and  some  are  unsatisfactory  (Badin,  1989,  1991).  Klatt  (1980) 
used  a random  number  generator,  a spectrum-shaping  filter,  and  an  amplitude  modulator 
to  model  the  turbulent  flow  for  the  formant  synthesizer.  The  spectrum-shaping  filter  was 
designed  to  simulate  the  spectral  characteristics  of  the  turbulent  flow.  A first  order  HR 
filter  was  used  to  obtain  the  volume  velocity  due  to  a random  pressure  source.  Childers 
and  Lee  (1991)  have  used  a FIR  filter  to  model  highpass-filtered  turbulence  noise.  Cook 
(1991,  1993)  used  a four-pole  filter  to  model  the  spectral  properties  of  the  noise  source. 

By  including  a latent  random  pressure  source,  Pn,  and  an  inherent  constriction 
loss,  Rn,  in  each  elemental  section  of  the  vocal  tract,  Flanagan  and  Cherry  (1968)  could 
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introduce  automatically  the  turbulent  flow  excitation  at  any  section.  The  Pn  source  was 
produced  from  Gaussian  noise,  which  was  bandpass-filtered  from  500  to  4000  Hz,  and  the 
flow,  Un,  was  lowpass-filtered  to  500  Hz  before  it  modulated  the  Pn  noise  source.  Figure 
2-20  illustrates  the  schematic  diagram.  Such  a turbulence  noise  model  has  been  used  in 
several  studies  (Flanagan  et  al„  1975,  1980;  Flanagan  and  Ishizaka,  1976).  However,  as 
Sondhi  and  Schroeter  (1987)  pointed  out,  the  Flanagan  and  Cherry  (1968)  model  did  not 
produce  satisfactory  unvoiced  sounds  due  to  the  too  high  “back”  cavity  impedance. 
Sondhi  and  Schroeter  (1986,  1987),  thus,  modified  the  above  distributed  and  series 
pressure  noise  source  model  into  a parallel  flow  source  Un  = Pn/Rn,  which  was  located 
downstream  from  the  constriction.  The  Pn  is  given  by 

Pn  = turbg  • rand  • (r|  - R^),  for  Re  > R^ 

= °*  for  Re  < Rec  (2.27) 

where  turbg  is  empirically  determined  as  the  turbulence  gain,  and  rand  is  a random 
number  uniformly  distributed  between  -0.5  and  0.5.  A first-order  HR  filter  with  cutoff 
frequency  2000  Hz  was  used  to  lowpass  the  flow.  Figures  2-2 1(a)  and  (b)  show  the 
equivalent  circuits  of  the  serial  and  parallel  turbulence  sources,  respectively. 

We  adopt  the  turbulence  noise  source  model  from  Sondhi  and  Schroeter  (1986, 
1987).  However,  our  model  allows  the  user  to  place  the  turbulence  noise  source  at  the 
center  of,  or  immediately  downstream  or  upstream  from  the  constriction  region,  or 
spatially  distributed  along  the  constriction  region.  The  turbulence  gain  and  critical 
Reynolds  number  can  also  be  specified. 

We  have  considered  an  acoustic  model  of  the  human  vocal  system.  Now,  we  can 
construct  a transmission-line  circuit  model  for  the  vocal  system.  Figure  2-22  illustrates 
the  model  structure  of  the  vocal  system  for  the  proposed  articulatory  synthesizer.  Based 
on  this  structure,  the  acoustic  transfer  function  for  different  characteristics  of  the  vocal 
system  can  be  evaluated.  The  most  important  purpose  of  this  model  structure  is  for 
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denying  the  acoustic  equations  for  synthesizing  speech.  Refer  to  Appendix  B and  C for 

acoustic  transfer  function  calculations  and  the  derivation  of  the  acoustic  equations, 
respectively. 


2-4  Analysis  of  Various  Vocal  System  Characteristics 

In  this  section,  we  analyze  the  effects  of  various  vocal  system  characteristics.  This 
analysis  provides  the  basis  of  selecting  the  appropriate  vocal  system  model  structure  and 
component  values  for  the  articulatory  synthesizer.  Five  American  vowels  and  diphthongs 
(/a,  i,  u,  el,  ou/)  are  investigated  under  different  vocal  characteristics.  The  vocal  tract 
cross-sectional  areas  for  these  vowels  and  diphthongs  are  given  in  Appendix  A,  while  the 
methods  for  calculating  the  acoustic  transfer  function  are  given  in  Appendix  B. 

2-4.1  Frequency-Dependent  Components 

As  mentioned  in  section  2.3. 1.3,  the  series  resistance  R and  the  shunt  conductance 
G in  the  equivalent  circuit  representation  of  a lossy  section  are  frequency  dependent.  In 
the  time-domain  approach  for  the  articulatory  synthesizer,  the  frequency-dependent 
components  have  to  be  simulated  at  a fixed  frequency.  The  effects  of  using  a fixed 
frequency  for  these  two  components  on  the  acoustic  transfer  function  are  not  well 
documented.  Wakita  and  Fant  (1978)  illustrated  the  effects  on  formant  frequencies  and 
bandwidths  of  five  Russian  vowels  when  the  frequency  was  fixed  at  1 kHz.  They 
summarized  that  the  formant  frequencies  are  scarcely  affected,  but  that  the  bandwidths  are 
affected  rather  appreciably.  Figure  2-23  illustrates  the  acoustic  transfer  functions  of  five 
vowels  for  three  fixed  frequencies:  1 kHz,  2.5  kHz,  and  4 kHz.  The  acoustic  transfer 
functions  are  calculated  with  the  glottis  closed  and  no  nasal  tract  coupling.  The  formants 
are  not  affected  appreciably,  which  agrees  with  Wakita  and  Fant’s  (1978)  result.  For 
formant  frequencies  below  1.5  kHz,  the  formant  bandwidths  for  the  fixed  frequency  cases 
are  wider  than  the  frequency-dependent  case.  For  formant  frequencies  above  1.5  kHz,  the 
1 kHz  case  has  the  narrowest  formant  bandwidths,  which  is  even  narrower  than  the 
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frequency-dependent  case.  As  a trade-off,  we  set  the  frequency  at  2.5  kHz  for  the 
frequency-dependent  components. 

2-4.2  Number  of  Vocal  Tract  Sections 

As  described  in  section  2.3. 1.2,  the  vocal  tract  can  be  approximated  by  a 
concatenation  of  uniform  elemental  sections.  The  number  of  elemental  sections,  SN,  has 
to  be  large  enough  in  order  for  the  acoustic  characteristics  of  the  concatenated  tubes  to  be 
indistinguishable  from  the  continuous  ones.  In  this  section,  we  investigate  the  influence 
of  spatial  sampling  on  the  acoustic  transfer  function  of  the  vocal  tract.  Figure  2-24  shows 
the  acoustic  transfer  functions  of  different  numbers  of  vocal  tract  sections  for  five  vowels. 
The  acoustic  transfer  functions  are  calculated  with  the  glottis  closed  and  no  nasal  tract 
coupling.  It  is  seen  that  the  formant  frequencies  shift  upwards  when  the  spatial  sampling 
interval  increases,  except  for  the  second  and  third  formants  for  the  vowel  /i/,  where  the 
formants  shift  downward.  From  Figure  2-24,  we  can  see  that  the  spatial  sampling 
interval,  i.e.,  number  of  elemental  sections,  has  a more  significant  effect  on  the  acoustic 
transfer  functions  of  vowels  /u/  and  /ou/  than  others.  However,  a ten-section 

cross-sectional  area  function  is  not  enough  to  represent  the  acoustic  characteristics  of  the 
vocal  tract. 

2.4.3  Nasal  Tract  System 

To  study  the  acoustic  properties  of  the  nasal  tract,  the  acoustic  transfer  functions  of 
the  nasal  tract  with  different  opening  areas  of  the  velopharyngeal  port  are  calculated. 
Figure  2-25  shows  the  resonant  characteristics  of  the  nasal  tract  with  various 
velopharyngeal  port  opening  areas.  Basically,  the  nasal  cavity  has  three  resonant 
frequencies.  The  first  resonance  is  not  affected  by  the  velopharyngeal  port  opening  area. 
However,  the  second  and  third  resonances  shift  downward  when  the  velopharyngeal  port 
opening  area  increases. 
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Figure  2-25:  The  effect  of  velopharyngeal  port  opening  area  on  the  nasal 
tract  acoustic  transfer  function. 


Figure  2-26:  The  effect  of  extra  sinus  cavities  on  the  nasal  tract 
acoustic  transfer  function. 
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The  effect  of  the  extra  sinus  cavities  on  the  acoustic  transfer  function  of  the  nasal 
tract  can  be  studied  from  Figure  2-26.  The  maxillary  sinus  and  the  frontal  sinus  are 
located  at  4 cm  and  8 cm  positions,  respectively,  from  the  nostrils.  The  two  sinus  cavities 
are  tuned  to  500  Hz  and  1400  Hz  respectively.  It  is  seen  that  the  frontal  sinus  has  a 
limited  effect  on  the  nasal  tract  acoustic  transfer  function,  where  it  causes  a zero-pole  pair 
in  the  vicinity  of  its  resonance  frequency.  This  is  the  reason  why  Maeda  (1982b)  ignored 
this  sinus  without  losing  the  essentials  of  the  nasal  tract.  The  first  resonance  of  the  no 
sinus  nasal  tract  is  shifted  downward  with  a lower  peak  level  as  a result  of  the  maxillary 

sinus  coupling.  The  maxillary  sinus  also  brings  about  a pole-zero  pair  in  the  vicinity  of  its 
resonance  frequency. 

Lowering  the  velum  creates  a side  passage  for  the  air  flow  through  the  nasal 
cavity,  giving  rise  to  complex  modifications  of  the  acoustic  characteristics  of  the  sound. 
Figure  2-27  shows  how  this  mechanism  affects  the  acoustic  transfer  function  of  the  vocal 
system.  The  velopharyngeal  opening  area  is  0.5  cm2.  For  the  nasal  tract  with  sinus 
cavity,  only  the  maxillary  sinus  is  included  and  is  tuned  to  500  Hz.  It  is  well  known  that 
the  parallel  branching  of  the  nasal  tract  at  the  velum  causes  the  antiresonances  of  the  vocal 
tract  acoustic  transfer  function.  The  antiresonances  at  the  vicinities  of  1 kHz  and  3 kHz 
can  be  seen  clearly  for  some  vowels.  The  effect  of  an  extra  sinus  on  the  acoustic  transfer 
functions  for  vowels  /a,  el/  is  more  significant  than  for  vowels  /i,  u,  ou/.  This  result 
supports  Maeda’s  statement  (1982b)  that  the  high  vowels,  such  as  N and  /u/,  are  more 

nasalized  than  the  middle  and  low  vowels,  such  as  /a/  and  /a/,  even  when  the  nasal  sinus  is 
not  included. 

2-4-4  Glottal  Impedance  and  Subglottal  System 

The  influence  of  the  glottal  impedance  and  the  subglottal  system  on  the  acoustic 
transfer  function  of  the  vocal  system  can  be  studied  from  Figure  2-28.  When  the  glottal 
area  is  small,  i.e.,  the  glottal  impedance  is  relatively  high,  the  influence  is  insignificant. 
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For  a large  glottis,  the  increased  loading  of  the  vocal  tract  causes  an  increase  in  the 
bandwidths  and  to  some  extent  also  in  the  formant  frequencies.  It  is  obvious  that  the 
influence  of  the  subglottal  system  depends  on  the  glottal  impedance.  When  the  glottal 
area  is  small,  the  influence  of  the  subglottal  resonances  is  small,  and  vice  versa. 

2.4.5  Excitation  in  the  Vocal  Tran 

Placing  the  excitation  source  in  the  vocal  tract  results  in  a very  complicated 
system.  This  section  examines  the  effects  of  relocating  the  excitation  on  the  acoustic 
transfer  function  (see  Figure  2-29).  It  is  found  that  the  vocal  tract  resonant  frequencies 
are  relatively  unaffected  when  the  excitation  is  placed  forwarded  from  the  pharynx  to  the 
front  oral  cavity.  This  introduces  antiresonances  into  the  acoustic  transfer  function. 
Another  feature  is  that  the  number  of  antiresonances  increases  when  the  excitation  is 
placed  forwarded  from  the  pharynx  to  the  front  oral  cavity.  The  excitation  source 
waveform  can  be  obtained  by  deconvolving  the  speech  signal  from  the  modified  acoustic 
transfer  function.  It  is  expected  that  such  an  excitation  waveform  differs  from  the  glottal 
waveform,  which  generally  contains  no  zeros.  Different  vowels  having  different 
antiresonances,  due  to  the  different  vocal  tract  shapes,  have  made  the  modeling  of  the 
excitation  waveform  inside  the  vocal  tract  difficult,  if  not  impossible.  One  possible  way 
to  generate  the  excitation  waveform  inside  the  vocal  tract  is  to  prefilter  the  glottal  pulse 
with  the  inverse  filter  of  the  modified  acoustic  transfer  function. 

2.5  Articulatory  Synthesizers 

Basically,  there  are  three  approaches  used  in  articulatory  speech  synthesis.  The 
wave  digital  filter  approach  (Fettweis  and  Meerkotter,  1975;  Lawson  and  Mirzai,  1990) 
extends  the  Kelly-Lochbaum  model  (1962).  This  approach  is  based  on  forward  and 
backward  traveling  waves  in  a lossless  acoustic  tube  (Titze,  1973;  Rubin  et  al.,  1981; 
Strube,  1982;  Meyer  et  al.,  1989),  and  can  be  realized  for  real-time  synthesis  (Meyer  et 
al.,  1989).  It  usually  omits  many  acoustic  effects,  such  as  proper  handling  of  all  existing 
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losses  in  the  tracts,  realistic  modeling  of  the  glottis,  and  appropriate  modeling  of  the 
source-tract  interaction.  Also  the  vocal  tract  length  cannot  be  varied  easily  since  the 
length  of  each  section  is  fixed  and  related  to  the  sampling  frequency  (Wakita,  1973). 
Recently,  some  progress  has  been  made  in  modeling  voiceless  excitation,  damping,  and 
the  glottal  excitation  (Meyer  et  al.,  1989).  The  dynamic  variation  of  vocal  tract  length 
can  be  simulated  by  varying  the  sampling  rate  (Wright  and  Owens,  1993). 

The  second  approach  uses  a hybrid  time-frequency  domain  method,  which  models 
the  highly  nonlinear  glottal  characteristics  in  the  time  domain  and  the  linear  tract  with 
frequency-dependent  losses  and  wall  vibration  characteristics  in  the  frequency  domain 
(Allen  and  Strong,  1985;  Sondhi  and  Schroeter,  1986,  1987).  The  tract  filter  function  and 
glottal  source  excitation  function  are  interfaced  by  an  inverse  Fourier  transformation  and 
digital  convolution.  The  problems  with  this  approach  are  that  it  is  incapable  of  producing 
the  dynamic  transitions  of  certain  phonemes,  e.g.,  plosives,  and  it  needs  additional  care  to 
cope  with  the  interaction  between  voiced  and  voiceless  sources  (Lin,  1990).  In  addition, 
it  does  not  calculate  the  pressure  and  volume  velocity. 

The  third  approach  is  to  model  the  human  vocal  system  as  a large  set  of,  linear  or 
nonlinear,  difference  equations  to  be  solved  in  each  sampling  interval  to  give  samples  of 
the  pressure  and  volume  velocity  at  each  point  in  the  transmission-line  circuit  (Flanagan 
and  Cherry,  1968;  Flanagan  and  Landgraf,  1968;  Flanagan  and  Ishizaka,  1976;  Flanagan 
et  al.,  1975,  1980).  The  values  of  pressure  and  volume  velocity  at  one  time  instant  are 
used  to  determine  the  losses  for  the  next  time  interval.  This  approach  has  been  referred  to 
as  the  time-domain  approach  (Sondhi  and  Schroeter,  1987).  Figure  2-30  shows  the 
schematic  diagram  of  this  approach. 

In  the  time-domain  approach,  a very  high  sampling  rate  is  usually  required  to 
avoid  frequency-warping  distortion  (Wakita  and  Fant,  1978).  In  addition,  the 
frequency-dependent  components  are  simulated  at  a fixed  frequency  (see  section  2.4.1). 
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Figure  2-30:  Time-domain  approach  of  articulatory  synthesis. 
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Natural-sounding  speech,  however,  can  be  generated.  Several  advantages  have  made  the 
time-domain  approach  popular,  although  its  computation  is  cumbersome.  These 
advantages  are  that  the  aerodynamic  interaction  is  inherently  included,  the  pressure  and 
volume  velocity  at  any  point  can  be  computed,  and  the  dynamic  articulatory  gestures  can 
be  obtained  when  combined  with  the  articulatory  model.  In  our  study,  the  time -domain 
approach  is  used  for  realization  of  the  acoustic  model. 

Maeda  (1982a)  simplified  Flanagan’s  model  by  replacing  the  mechanical  vibration 
model  of  vocal  cords  with  glottal  area  control  parameters,  by  discarding  noise  sources 
within  the  vocal  tract,  and  by  omitting  the  effects  of  the  nasal  sinuses.  These 
simplifications  made  the  synthesis  much  faster.  Bocchieri  (1983)  and  Bocchieri  and 
Childers  (1984)  introduced  other  simplifications,  who  reduced  the  number  of  noise 
sources  and  modeled  the  vocal  tract  variation  by  drawing  a sequence  of  midsagittal  vocal 
tract  outlines  on  a graphic  terminal.  Based  on  Maeda’s  (1982a)  work,  Childers  and  Ding 
(1991)  implemented  an  articulatory  speech  synthesizer  by  using  a discrete  circuit  model 
that  converts  the  acoustic  equations  into  linear  algebraic  equations. 

We  rederived  the  acoustic  equations  (see  Appendix  C)  of  the  vocal  system  to 
include  the  subglottal  system,  the  glottal  impedance,  the  turbulence  noise  source,  and  the 
sinus  cavities.  Table  2-5  lists  the  proposed  articulatory  synthesizer  (Figure  2-22)  in 
comparison  with  other  main  articulatory  synthesizers. 

2.5.1  Realization 

Figure  2-31  shows  a software  block  diagram  of  the  model  constructed  for 
time-domain  articulatory  synthesis.  Two  options  are  provided  for  the  interpolation  of  the 
vocal  tract  configuration:  vocal  tract  cross-sectional  area  and  articulatory  parameters.  If 
the  articulatory  parameters  are  interpolated,  the  articulatory  model  is  used  to  transform 
the  parameters  to  the  vocal  tract  cross-sectional  area.  The  default  number  of  vocal  tract 
sections  is  60.  The  spatial  resolution  conversion  converts  the  number  of  vocal  tract 


Table  2-5:  Comparison  of  several  articulatory  synthesizers. 
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Figure  2-31 : The  software  block  diagram  of  the  proposed  articulatory  synthesizer. 
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sections  optionally  to  30,  20,  15,  12,  or  10  sections.  Then,  the  vocal  tract  cross-sectional 
area  is  transformed  to  the  equivalent  RLC-network.  On  the  other  hand,  the  excitation 
parameters  are  interpolated  and  the  excitation  waveform  is  generated,  according  to  the 
interpolated  parameters,  as  the  source  input  to  the  circuit  network.  The  nasal  sinus 
cavities  and/or  the  subglottal  system  can  be  included  optionally  in  the  circuit  network.  By 
applying  Kirchoff’s  and  Ohm’s  laws  and  the  trapezoidal  algorithm,  the  discrete-time 
acoustic  matrix  equations  are  formed  (see  Appendix  C for  the  details).  The  pressure  at  the 
midpoint  of  each  section  and  volume  velocity  at  the  junction  of  adjacent  sections  are 
calculated  as  solutions  using  the  elimination  procedure  and  a backward  substitution.  The 
synthetic  speech  is  the  backward  difference  between  the  sum  of  the  volume  velocities  at 
the  nostrils  and  lips  at  the  current  time  instant  and  the  sum  of  the  volume  velocities  at  the 
nostrils  and  lips  at  the  previous  time  instant.  The  synthesis  procedure  is  repeated  by 
refreshing  the  force  constants  (see  Appendix  C),  updating  the  time  instant,  and  advancing 
the  target  and/or  excitation  frames  until  the  time  epoch  is  reached. 

A sketch  of  the  configuration  and  excitation  parameter  interpolations  is  illustrated 
in  Figure  2-32.  Only  two  target  frames  and  two  excitation  frames  are  shown  in  this 
figure.  Either  the  vocal  tract  cross-sectional  areas  or  the  articulatory  parameters  are 
interpolated  between  the  current  target  frame  i and  the  next  target  frame  i+1  during  the 
synthesis  of  speech.  Notice  that  the  target  frames  have  to  be  converted  to  the  specified 
spatial  resolution,  as  mentioned  in  the  previous  paragraph.  Assume  that  the  converted 
target  frame  structures  are  pointed  to  by  pointers  Tfsynl  and  Tfsyn2,  respectively.  A 
temporary  pointer,  Ttemp,  is  used  to  point  to  the  interpolated  structure  configuration. 
Similarly,  a temporary  pointer,  tempex,  is  used  to  point  to  the  interpolated  excitation 
structure.  Both  data  structures  are  used  to  generate  speech. 
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2.5.2  Interpolation  Functions 


Two  interpolation  functions,  linear  and  arctan,  are  provided  to  interpolate  the 
vocal  tract  configuration:  1)  vocal  tract  cross-sectional  area  or  2)  articulatory  parameters. 
For  any  one  articulatory  parameter  or  any  one  vocal  tract  cross-sectional  area,  linear 
interpolation  can  be  described  through  the  following  function: 

y = a + p • t (2.28) 

where  y is  the  interpolated  articulatory  parameter  or  vocal  tract  cross-sectional  area,  t is 
time,  and  a and  (3  are  interpolation  parameters.  It  is  obvious  that  the  two  interpolation 
parameters  are  determined  by  the  two  target  frames,  i and  i+1. 


The  arctan  interpolation  function  is  defined  as 

y = a + P • arctan[y(t  - t0)]  (2.29) 

where  the  a and  P are  interpolation  parameters  that  are  determined  by  the  two  target 
frames,  i and  i+1,  y is  the  rate  of  transition,  and  t0  is  the  point  of  transition  (Gupta  and 
Schroeter,  1993).  For  a one-dimensional  case,  let  (q,  y;)  and  (ti+1,  yi+1)  denote  the  time 
and  the  corresponding  parameter  values  of  the  two  target  frames,  i and  i+1,  respectively. 
Then,  from  equation  (2.29),  we  obtain 


0 = 


y;  - yi+i 


arctan 


Vfe  ~ t0)]  - arctan |y(ti+1  - t0) 
a = yi  “ P • arctanfyftj  - t0)] 


(2.30) 

(2.31) 


Figure  2-33  illustrates  various  interpolation  functions  for  different  values  of  y and  linear 
function.  We  implemented  a popup  window  to  specify  the  start  transition  time,  the 
transition  point,  and  the  transition  rate  (see  Chapter  4,  section  4.4). 


2.6  Summary 

This  chapter  has  focused  on  four  main  areas  of  the  articulatory  synthesizer  model: 
the  articulatory  model,  the  acoustic  model,  the  analysis  of  various  vocal  system 
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characteristics,  and  the  articulatory  synthesizer.  After  a brief  review  of  the  articulatory 
model,  we  defined  the  articulatory  parameters  and  described  the  articulatory  model  in 
some  detail.  Our  articulatory  model  represents  the  vocal  tract  by  as  many  as  60  sections 
to  provide  more  reliable  estimates  of  the  cross-sectional  areas.  We  have  made  an  attempt 
to  cover  the  acoustic  model  of  the  entire  vocal  system,  which  includes  the  vocal  tract,  the 
nasal  tract,  the  sinuses,  the  glottal  impedance,  the  subglottal  tract,  the  glottal  excitation 
source,  and  the  turbulence  noise  source.  A transmission-line  circuit  model  of  the  vocal 
system  was  constructed.  Also  included  in  this  chapter  were  the  analysis  of  several 
characteristics  of  the  vocal  system  that  are  based  on  the  calculation  of  the  acoustic  transfer 
function.  Such  an  analysis  has  provided  a basis  for  choosing  appropriate  parameters  for 
the  articulatory  synthesizer.  The  effect  of  relocating  the  excitation  source  on  the  acoustic 
transfer  function  was  also  described.  Finally,  the  strategy  of  the  implementation  of  the 
articulatory  synthesizer  was  presented  after  a review  of  the  articulatory  synthesis 
approaches.  The  time-domain  approach  was  implemented  to  provide  the  ability  to 
investigate  the  dynamic  properties  of  the  vocal  system.  Two  types  of  interpolation 
functions,  linear  and  arctan,  were  used  to  interpolate  the  vocal  tract  cross-sectional  area  or 
articulatory  parameters. 


CHAPTER  3 

SPEECH  INVERSE  FILTERING 


The  recovery  of  articulatory  movements  from  the  speech  signal,  known  as  the 
speech  inverse  filtering  problem,  is  difficult  due  to  the  non-uniqueness  of  the  solution. 
This  problem  has  been  the  subject  of  research  for  several  applications,  including 
articulatory  synthesis,  speech  recognition,  low-bit-rate  speech  coding,  and  text-to-speech 
synthesis.  In  this  chapter,  we  attempt  a new  solution  using  the  simulated  annealing 
algorithm,  which  is  a constrained  multidimensional  nonlinear  optimization  problem.” 
The  coordinates  of  the  jaw,  tongue  body,  tongue  tip,  lips,  velum,  and  hyoid  compose  the 
multidimensional  articulatory  vector.  A comparison  between  the  model-derived  and  the 
target-frame  first  four  formant  frequencies  forms  the  cost  function.  There  are  two 
constraints:  (1)  the  articulatory-to-acoustic  transformation  function,  and  (2)  the  boundary 
conditions  for  the  articulatory  parameters.  The  optimum  articulatory  vector  is  obtained 
by  finding  the  minimum  cost  function.  Once  the  optimum  articulatory  vector  is 
determined,  the  articulatory  model  determines  the  vocal  tract  cross-sectional  area  function 
which  in  turn  is  used  by  the  articulatory  speech  synthesizer. 

3-1  Review  of  the  Derivations  of  the  Vocal  Tract  Area  Function 

Geometric  data  concerning  the  vocal  tract  is  essential  to  our  understanding  of 
articulation,  and  is  a key  factor  in  speech  production.  The  acoustical  theory  of  speech 
production  (Fant,  1960)  views  the  vocal  tract  as  an  acoustical  tube  with  a varying 
cross-sectional  area.  The  success  of  articulatory  modeling  depends  to  a large  extent  on  the 
accuracy  with  which  the  vocal  tract  cross-sectional  area  function,  A(x),  can  be  specified 
for  a particular  utterance.  Measurement  of  the  vocal  tract  geometry  is  difficult.  Basically, 
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there  are  two  methods  for  obtaining  the  vocal  tract  cross-sectional  area  function:  (1)  direct 
measurements  from  images  such  as  X-rays  and  (2)  estimating  the  area  function  from 
acoustic  data. 

3.1.1  Direct  Measurements 

Direct  measurements  of  the  vocal  tract  have  been  made  from  lateral  X-ray  images 
(e.g.,  Chiba  and  Kajiyama,  1941;  Fant,  1960;  Johansson  et  al.,  1983).  Unfortunately, 
these  direct  measurements  and  their  evaluations  are  laborious.  In  addition,  the  exposure 
to  X-ray  for  utterances  of  long  durations  is  a problem  owing  to  dosage  limitations. 
Magnetic  resonance  imaging  (MRI)  (Baer  et  al.,  1991),  which  is  free  from  the 
disadvantages  associated  with  X-ray  methods,  might  appear  to  be  the  best  available 
method  to  collect  the  necessary  data.  The  drawback,  however,  is  that  the  subject  may 
fatigue  since  the  imaging  process  requires  a long  time.  Additional  drawbacks  stem  from 
the  fact  that  the  resolution  of  air-tissue  boundaries  may  depend  on  the  thickness  of  the 
tissue  section,  and  the  calcified  structures  contain  little  mobile  hydrogen  and,  thus,  may  be 
indistinguishable  from  the  airway. 

3.1.2  Estimation  from  Acoustic  Data 

Several  researchers  have  proposed  analytical  methods  to  derive  the  vocal  tract 
cross-sectional  area  function,  A(x),  from  acoustic  data.  Two  approaches  are  based  on 
LPC  and  the  tube  impulse  response,  respectively.  The  LPC  approach  is  based  on  the  fact 
that  the  filtering  process  of  the  lossless  nonuniform  acoustic  tube  model  of  the  vocal  tract 
is  identical  to  that  of  the  optimal  inverse  filter  model  for  proper  boundary  conditions  at 
the  glottis  and  the  lips  (Atal  and  Hanauer,  1971;  Wakita,  1973,  1979;  Wakita  and  Gray, 
1975).  The  reflection  coefficients  are  extracted  by  inverse  filtering  the  speech  signal. 
Then,  the  vocal  tract  cross-sectional  area  function  can  be  obtained  from  the  set  of 
reflection  coefficients.  The  main  problem  with  this  approach  is  the  articulatory 
compensation  or  the  “ventriloquist  effect,”  i.e.,  the  fact  that  different  vocal  tract  shapes 
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can  produce  the  same  formant  frequencies  (Schroeder,  1967;  Mermelstein,  1967;  Atal  et 
al„  1978;  Schroeder  and  Strube,  1979;  Bonder,  1983;  Charpentier,  1984).  As  to  the  tube 
impulse  response  approach,  the  basic  concept  is  that,  if  the  transfer  function  of  the  vocal 
tract  is  known,  then  the  A(x)  can  be  derived  uniquely  (Schroeder,  1967;  Paige  and  Zue, 
1970;  Gopinath  and  Sondhi,  1970;  Sondhi,  1979;  Sondhi  and  Resnick,  1983;  Milenkovic, 
1984,  1987).  However,  finding  the  transfer  function  of  the  vocal  tract  involving  the  use  of 
impedance  tubes  with  externally  generated  excitation  does  not  allow  the  subject  to 
phonate  sounds. 

To  finesse  some  of  the  difficulties  of  the  analytical  methods,  the  “sorting”  and 
“codebook”  methods  perform  sampling  of  the  articulatory  parameters  from  the 
articulatory  model  and  establish  tables  of  vocal  tract  shapes  and  related  acoustical 
representations.  For  the  sorting  method,  reference  tables  are  established  by  covering  the 
articulatory  space  with  a uniform  or  non-uniform  grid  and  storing  the  acoustic  values 
computed  at  every  vertex  of  the  grid  (Atal  et  al.,  1978;  Charpentier,  1984;  Cook,  1991). 
These  tables  can  be  used  to  look  up  the  effective  vocal-tract  geometric  representations  that 
have  similar  acoustic  features.  Some  refinements,  such  as  singular  value  decomposition 
and  local  region  linearization  (Atal  et  al.,  1978),  have  been  used  to  solve  the  ambiguous 
geometric  subspace.  On  the  other  hand,  the  codebook  method  samples  the  articulatory 
space  randomly  and  prunes  it  to  retain  only  the  reasonable  shapes  in  the  codebook.  This 
method  provides  the  basis  for  the  vector  quantization  of  the  articulatory  space  (Larar  et 
al.,  1988).  In  1990,  Schroeter  et  al.  (1990)  made  some  improvements  for  the  generation 
of  codebooks  by  using  a dynamic  programming  search.  The  codebooks  are  accessed 
through  evaluating  a weighted  cepstral  distortion  measure  as  given  by  Meyer  et  al.  (1991). 
There  are  several  drawbacks  with  this  numerical  approach:  cumbersome  computations, 
sensitivity  to  the  source  excitation,  mapping  ambiguities,  and  acoustic  modeling 
limitations  (Schroeter  and  Sondhi,  1994). 
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A recent  approach  is  to  apply  an  artificial  neural  network  (ANN)  model  to  the 
speech  inverse  filtering,  since  it  is  a promising  approach  to  implement  codebooks.  The 
ANN  model  is  trained  with  a large  set  of  acoustic  parameter  patterns.  Then  a test  pattern 
of  acoustic  parameters  is  used  to  search  the  codebook  to  retrieve  a corresponding 
articulatory  pattern  parameter  set  (Xue  et  al.,  1990;  B&vegSrd  and  Hogberg,  1992,  1993; 
Papcun  et  al.,  1992;  Rahim  et  al.,  1993).  However,  the  learning  of  a large  set  of  training 
patterns  to  span  the  articulatory  space  is  still  a challenge  for  the  ANN  model  (Xue  et  al., 
1990).  In  addition,  as  Schroeter  and  Sondhi  (1994)  pointed  out,  no  clear  advantage  has  so 
far  been  shown  for  ANN  compared  to  other  approaches. 

The  feedback  methods  try  to  optimize  the  articulatory  parameters  that  are  adjusted 
until  the  synthetic  speech  features  differ  minimally  from  the  actual  speech  features.  The 
selected  speech  features  can  be  formants  (Prado,  1991;  Prado  et  al.,  1992),  spectral 
(Flanagan  et  al.,  1980;  Levinson  and  Schmidt,  1983;  Parthasarathy  and  Coker,  1990, 
1992;  Gupta  and  Schroeter,  1991,  1993;  Guo  and  Milenkovic,  1993),  or  others.  The 
optimization  can  be  done  on  a phoneme-by-phoneme  bases  (Parthasarathy  and  Coker, 
1990,  1992)  or  on  a frame-by-frame  basis  (Flanagan  et  al.,  1980;  Levinson  and  Schmidt, 
1983;  Gupta  and  Schroeter,  1991,  1993;  Prado,  1991;  Prado  et  al.,  1992).  Several  search 
algorithms  have  been  used,  such  as  the  Hooke  and  Jeeves  algorithm  (Flanagan  et  al., 
1980;  Parthasarathy  and  Coker,  1990,  1992;  Gupta  and  Schroeter,  1991,  1993),  the 
optimal  gradient  algorithm  (Levinson  and  Schmidt,  1983),  and  combinations  of  the 
modified  Fletcher-Reeves  method  and  linear  successive  approximation  (Prado,  1991; 
Prado  et  al.,  1992).  The  problem  of  local  minima  related  to  the  nonlinearity  in  the  speech 
inverse  filtering  is  a major  impediment  of  this  method. 

Advances  in  computer  technology  have  allowed  the  solution  of  optimization 
problems  that  require  large  numbers  of  complicated  function  evaluations  to  be  computed 
on  relatively  inexpensive  machines  in  a reasonable  time.  Thus,  stochastic  methods,  such 
as  genetic  algorithms  that  serve  as  search  procedures  based  on  the  mechanics  of  natural 
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selection  and  natural  genetics  (Goldberg,  1989),  can  be  applied  to  the  speech  inverse 
filtering  problem.  Some  preliminary  results  have  been  obtained  by  McGowan  (1994). 
Articulatory  trajectories  of  an  articulatory  model  were  recovered  by  means  of  a genetic 
algorithm  from  the  first  three  formant  frequencies  using  a task-dynamic  model  (Saltzman, 
1986;  Saltzman  and  Kelso,  1987;  Saltzman  and  Munhall,  1989)  of  speech  articulation. 
Tests  on  synthesized  utterances  show  that  the  method  can  recover  the  major  aspects  of  an 
original  trajectory,  but  it  has  trouble  in  obtaining  the  precise  timing  of  events.  An 
additional  difficulty  for  the  genetic  algorithm,  as  Goffe  et  al.  experienced  (1994),  stems 
from  the  fact  that  it  is  in  need  of  further  development  to  become  more  usable  for 
continuous  function  problems,  since  it  has  difficulty  with  a relatively  flat  surface. 

In  general,  finding  the  global  minimum  value  of  a cost  function  with  many 
degrees  of  freedom  is  difficult,  since  the  cost  function  tends  to  have  many  local  minima. 
A procedure  for  solving  such  optimization  problems  should  sample  values  of  the  cost 
function  in  such  a way  as  to  have  a high  probability  of  finding  a near-optimal  solution  and 
should  also  lend  itself  to  efficient  implementation.  Over  the  past  few  years,  simulated 
annealing  has  emerged  as  a viable  technique  that  meets  these  criteria.  Simulated 
annealing  that  is  modeled  on  processes  found  in  nature,  i.e.,  thermodynamics  (Metropolis 
et  al.,  1953;  Kirkpatrick  et  al.,  1983),  is  a stochastic  optimization  method.  It  explores  the 
function’s  entire  surface  and  tries  to  optimize  the  function  while  moving  uphill  and 
downhill.  Thus,  this  technique  is  largely  independent  of  the  starting  values,  which  is  often 
a critical  factor  in  conventional  optimization  algorithms.  Simulated  annealing  also  makes 
less  stringent  assumptions  regarding  the  function  than  do  conventional  algorithms.  For 
example,  the  function  need  not  be  continuous  since  the  method  does  not  require  the 
calculation  of  derivatives.  Because  of  these  relaxed  assumptions,  it  can  more  easily  deal 
with  functions  that  have  ridges  and  plateaus.  In  addition,  it  can  be  applied  to  optimize  a 
“black  box”  system  for  which  one  only  needs  to  define  the  state  (the  parameter  space)  and 
to  compute  the  corresponding  energy  (cost  function  value).  Finally,  functions  that  are  not 
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defined  for  some  parameter  values  can  also  be  optimized  by  the  simulated  annealing 
method  (Vanderbilt  and  Louie,  1984;  Bohachevsky  et  al.,  1986;  Corana  et  al.,  1987;  Goffe 
et  al.,  1992,  1994). 

Based  on  the  above  reviews  and  discussions,  we  selected  the  simulated  annealing 
algorithm  for  optimizing  the  nonlinear  acoustic-to-articulatory  transformation,  i.e.,  speech 
inverse  filtering. 


3.2  Simulated  Annealing  Algorithms 

Simulated  annealing  was  first  derived  from  statistical  mechanics,  where  the 
thermodynamic  properties  of  a large  system  in  thermal  equilibrium  at  a given  temperature 
were  studied  (Metropolis  et  al.,  1953).  A description  of  the  physical  annealing  process 
inspired  this  algorithm.  In  this  situation  a solid  metal  is  to  be  melted  at  a high 
temperature.  After  slow  cooling  (annealing),  the  molten  metal  arrives  at  a low  energy 
state,  since  careful  cooling  brings  the  material  to  a highly  ordered,  crystalline  state. 
Inherent  random  fluctuations  in  energy  allow  the  annealing  system  to  escape  local  energy 
minima  to  achieve  the  global  minimum.  However,  if  the  material  is  cooled  very  quickly 
(or  ‘quenched’),  it  might  not  escape  local  energy  minima  and  when  fully  cooled  it  may 
contain  more  energy  than  annealed  metal.  Simulated  annealing  attempts  to  minimize  an 
analogue  of  energy  in  an  annealing  process  to  find  the  global  minimum.  Kirkpatrick  et  al. 
(1983)  were  the  first  to  propose  and  demonstrate  the  application  of  simulated  annealing 
techniques  to  problems  of  combinatorial  optimization,  specifically  to  the  problems  of  wire 
routing  and  component  placement  in  VLSI  design.  Both  Vanderbilt  and  Louie  (1984)  and 
Bohachevsky  et  al.  (1986)  have  modified  simulated  annealing  for  continuous  variable 
problems.  However,  the  Corana  et  al.  (1987)  implementation  of  simulated  annealing  for 
continuous  variable  problems  appears  to  offer  the  best  combination  of  ease  of  use  and 
robustness,  so  it  is  used  for  our  optimization  process. 
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3.2.1  Origin  of  the  Algorithm 


As  far  back  to  1953,  Metropolis  et  al.  (1953)  proposed  a method  for  computing  the 
equilibrium  distribution  of  a set  of  particles  in  a “heat  bath”  using  a computer  simulation 
method.  For  the  system  in  thermal  equilibrium  at  a given  temperature  T,  they  assumed 
that  the  probability  7tT(c)  that  the  system  is  in  a given  configuration  c depends  upon  the 
energy  E(c)  of  the  configuration  and  follows  the  Boltzmann  distribution: 


E(c) 

*Kc)  = 

Ze  kT 

sec 


(3.1) 


where  k is  Boltzmann’s  constant  and  C is  the  set  of  all  possible  configurations.  The 
configuration  of  the  system  is  identified  with  the  set  of  spatial  positions  of  the  particles. 
A stochastic  relaxation  technique  was  developed  to  simulate  the  behavior  of  the  system. 
Suppose  that  the  system  is  in  configuration  Ct  at  time  t.  A candidate  configuration  Cn  for 
the  system  at  time  t+1  is  generated  randomly.  The  criterion  for  selecting  or  rejecting 
configuration  Cn  as  a new  configuration  (state)  depends  on  the  difference  of  energies 
between  configuration  Cn  and  configuration  Ct.  Define  p,  the  ratio  of  the  probability  of 
being  in  Cn  to  the  probability  of  being  in  Ct,  as: 


= Jir(Cn) 

P JtT(Ct) 


(E(Cn)-E(Ct)) 

kT 


(3.2) 


Then,  apply  a criterion,  which  has  come  to  be  known  as  the  Metropolis  criterion  or 
algorithm,  to  decide  the  acceptance  of  Cn.  The  Metropolis  criterion  can  be  stated  as 
follows:  If  p > 1,  that  is,  the  energy  of  Cn  is  strictly  less  than  the  energy  of  Ct,  then 
configuration  Cn  is  automatically  accepted  as  the  new  configuration  for  time  t+1.  If  p < 
1 , that  is,  the  energy  of  Cn  is  greater  than  or  equal  to  that  of  Ct,  then  configuration  Cn  is 
accepted  as  the  new  configuration  with  probability  p.  So  a move  to  a state  of  higher 
energy  is  accepted  in  a limited  way.  By  repeating  this  process  for  a large  enough  number 
of  moves,  that  is,  as  t -*  oo , regardless  of  the  starting  configuration,  it  can  be  shown  that 
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the  distribution  of  configurations  generated  converges  to  the  Boltzmann  distribution 
(Geman  and  Geman,  1984). 

3.2.2  The  Cooling  Schedule 

A fundamental  question  arises  in  statistical  mechanics  concerning  the  system  in 
the  limit  as  it  approaches  a low  temperature,  for  example,  whether  cooling  produces 
crystalline  or  glassy  solids  in  a metallurgic  process.  To  achieve  ground  state  (a 
low-energy  crystalline  configuration),  simply  lowering  the  temperature  is  not  sufficient. 
Rather,  a cooling  schedule  must  be  followed,  where  the  temperature  of  the  system  is 
elevated,  and  then  gradually  lowered,  spending  enough  time  at  each  temperature  to 
guarantee  that  thermodynamic  equilibrium  has  been  reached.  If  insufficient  time  is  spent 
at  each  temperature,  especially  at  a lower  temperature,  then  the  probability  of  achieving  a 
low-energy  crystalline  state  is  greatly  reduced. 

The  application  of  annealing  process  to  optimization  problems  involves  several 
steps.  First,  one  must  identify  the  analogues  of  the  physical  concepts  in  the  optimization 
problem.  The  energy  function  becomes  the  cost  function.  The  configuration  of  panicles 
becomes  the  combination  of  independent  variable  values.  The  rearrangement  of  panicles 
becomes  the  iterative  improvement  of  function  values  by  changing  variable  values. 
Finding  a low-energy  configuration  is  a near-optimal  solution,  and  the  temperature 
becomes  the  control  parameter  for  the  process.  Second,  one  must  have  a way  of 
generating  the  candidate  states.  Usually,  states  are  generated  with  a probability  density 
function  g(x)  that  has  a gaussian-like  peak.  Third,  one  must  have  a way  of  selecting  the 
new  state.  A state  acceptance  probability  allows  occasional  hill-climbing  as  well  as 
descents.  The  acceptance  probability  is  based  on  the  chances  of  obtaining  a new  state 
relative  to  a previous  state.  Two  acceptance  probability  equations  have  been  used 
successfully,  Boltzmann  machine  and  Metropolis  algorithm,  which  are  given  by 
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Boltzmann  machine: 


p(AE)  = 


1 


for  all  AE. 


(3.3) 


(l+eW*))’ 


Metropolis  algorithm: 


p(AE)  = 1.0, 


for  AE  < 0, 


= e 


for  AE  > 0. 


(3.4) 


where  AE  = E(new  state)  - E(current  state)  is  the  energy  gap  between  the  new  state 
and  the  current  state.  The  Boltzmann  machine  is  known  to  better  approximate  the 
physical  metaphor,  but  is  more  computationally  expensive  (Davis  and  Ritter,  1987).  The 
Metropolis  algorithm  is  used  in  this  study.  Fourth,  one  must  specify  a cooling  schedule 
consisting  of 

HI  an  initial  value  of  the  control  parameter,  i.e.,  the  initial  artificial  temperature  T; 

[2]  a decrement  function  for  decreasing  the  value  of  the  control  parameter,  i.e.,  the 
cooling  rate; 

GO  a final  value  of  the  control  parameter  or  a stop  criterion; 

0 a finite  number  of  moves  at  each  downward  control  parameter  value,  i.e.,  the 
amount  of  time  spent  at  each  temperature. 

Such  an  analogy  was  first  suggested  by  Kirkpatrick  et  al.  (1983).  They  linked  the 
algorithm  with  combinatorial  optimization,  specifically  to  the  problems  of  wire  routing 
and  component  placement  in  VLSI  design.  The  rapid  increase  in  inexpensive  computing 
power  has  lead  to  several  applications  of  the  simulated  annealing  algorithm,  including 
computer  and  circuit  design  (Vecchi  and  Kirkpatrick,  1983),  image  restoration  and 
segmentation  (Geman  and  Geman,  1984;  Camevali  et  al.,  1985),  the  travelling  salesman 
problem  (Bonomi  and  Lutton,  1984),  artificial  intelligence  (Hinton  and  Sejnowski,  1983), 
digital  filter  design  (Benvenuto,  et  al.,  1992;  Pitas,  1994),  and  vector  quantization  (Rose 
et  al.,  1992).  Because  of  the  success  of  the  simulated  annealing  in  combinatorial 
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optimization  problems,  its  potential  has  been  investigated  for  solving  continuous  function 
minimization  problems.  Vanderbilt  and  Louie  (1984)  first  modified  simulated  annealing 
by  using  a covariance  matrix  for  controlling  the  transition  probability.  Bohachevsky  et  al. 
(1986)  presented  a simple  and  easy  to  implement  method  in  which  the  length  of  a 
generation  step  is  constant.  However,  the  Corana  et  al.  (1987)  implementation  of 
simulated  annealing  for  continuous  variable  problems  appears  to  offer  the  best 
combination  of  ease  of  use  and  robustness,  and  has  been  used  in  econometric  problems 
(Goffe  et  al.,  1992,  1994). 

3.2.3  The  Simulated  Annealing  Algorithm 

The  Corana  et  al.  (1987)  algorithm  is  schematically  shown  in  Figure  3-1.  While  a 
detailed  description  of  the  algorithm  can  be  found  there,  we  briefly  describe  it  as  follows. 
Let  x be  an  M-dimensional  vector  with  components  [xj,  x2, ...,  xM].  Let  e(x)  be  the  cost 
function,  and  lbj  < Xj  < ubj,  j=l,  ...  , M,  be  the  M variables  with  corresponding 
boundaries.  The  algorithm  proceeds  iteratively  as  follows:  First,  a cost-function 

evaluation  is  made  at  the  initial  point  x and  its  value  e is  recorded.  Next,  a new  candidate 
point,  xn,  is  generated  by  varying  element  i of  x,  namely: 

x^  = x;  + r • vj  (3.5) 

The  variable  r is  a uniformly  distributed  random  number  from  [-1,1]  and  v;  is  element  i 
of  v,  the  step  length  vector  of  x.  The  new  function  value  en  is  then  computed.  If  en  is 
less  than  e,  xn  is  accepted,  x is  set  to  xn,  e is  set  to  £„,  and  the  search  path  moves 
downhill.  If  this  is  the  smallest  e at  this  point,  it  and  x are  recorded  since  this  is  the  best 
current  value.  If  en  is  greater  than  or  equal  to  e,  the  Metropolis  criterion  (Metropolis  et 
al.,  1953)  determines  acceptance.  Compute  the  value  p as  follows: 

p = e(e-en)/T  (36) 

In  this  equation,  T represents  the  current  temperature.  Generate  a uniformly  distributed 
random  number  pu  from  [0,  1],  Decide  the  action  based  on  the  result  of  value  comparison 
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Figure  3-1:  The  simulated  annealing  algorithm  (after  Corana  et  al.,  1987). 
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between  p and  pu:  If  pu  is  less  than  p,  the  new  point  is  accepted,  x is  updated  with  xn,  and 
the  search  path  moves  uphill.  Otherwise,  xn  is  rejected,  i.e.,  no  move  is  made.  Thus,  the 
process  repeats  from  the  new  point  (candidate  point  is  accepted)  or  from  the  current 
position  (candidate  point  is  rejected).  From  equation  (3.6),  we  can  see  that  the  probability 
of  an  uphill  move  decreases  when  the  temperature  is  lower  and  the  difference  in  the 
function’s  value  is  larger. 

After  Ns  steps  through  all  elements  of  x,  the  step  length  vector  v is  adjusted  so 
that  50%  of  all  moves  are  accepted.  The  goal  is  to  make  the  algorithm  follow  the  cost 
function  (Corana  et  al.,  1987).  A greater  percentage  of  accepted  points  means  that  the 
candidate  points  are  too  close  to  the  current  point.  Thus,  the  step  length  vector  v is 
enlarged.  For  a given  temperature,  this  step  adjustment  increases  the  number  of  rejections 
and  decreases  the  percentage  of  acceptances.  On  the  contrary,  a higher  percentage  of 
rejected  points  means  that  the  candidate  points  are  too  far  from  the  current  point.  A 
reduced  step  length  decreases  the  rejection  rate. 

After  Nx  times  through  the  above  loops  (corresponding  to  thermal  equilibrium), 
the  temperature,  T,  is  reduced.  The  temperature  is  updated  according  to  the  following 
equation: 

Tn  = rx  ■ T (3.7) 

The  reduction  coefficient  rx  has  a value  between  0 and  1.  The  starting  point  at  the  new 
temperature  (Tn)  is  the  optimum  point  obtained  at  the  last  temperature  (T).  This  makes 
the  search  path  start  at  the  most  favorable  point.  Since  the  temperature  characterizes  the 
degree  of  “excitation”  of  the  system,  a lower  temperature  decreases  the  number  of  uphill 
moves,  so  the  number  of  rejections  increase  and  the  step  size  declines.  The  lower 
temperature  (smaller  step  size)  makes  the  search  space  shrink  and  focus  on  the  most 
promising  area,  i.e.,  concentrates  most  of  the  search  in  a smaller  subset  of  low  energy 


points. 
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The  terminating  criterion  checks  if  there  have  been  no  significant  moves  for  the 
last  Ne  temperatures.  Assume  that  the  optimum  value  obtained  at  temperature  Tk  is  e£. 
Let  £opt  be  the  current  optimum  value  at  the  temperature  Tk+1.  If 

K ~ 6k-m|  ^ Tl,  where  m = 1,  . . Ne 

|ek  - eopt|  < r]  (3.8) 

then  stop  the  search.  Note  that  p is  a specified  small  constant.  This  check  makes  sure  that 
the  global  or  near-global  minimum  is  reached.  Another  stop  criterion  is  that  the  total 
number  of  cost-function  evaluations  exceeds  a specified  constant  Ntot. 

In  summary,  the  simulated  annealing  algorithm  starts  at  some  high  temperature 
specified  by  the  user.  A sequence  of  points  is  then  generated  until  an  equilibrium  is 
approached.  During  this  random  walk  process  the  step  length  vector  is  periodically 
adjusted  to  better  follow  the  cost  function  behavior.  After  thermal  equilibrium,  the 
temperature  is  reduced  and  a new  sequence  of  moves  is  made  starting  from  the  current 
optimum  point,  until  thermal  equilibrium  is  reached  again,  and  so  forth.  The  process  is 
terminated  at  a low  temperature  such  that  no  more  useful  moves  can  be  made,  according 
to  the  stopping  criterion. 

3.3  Speech  Inverse  Filtering  Strategy  and  Procedure 

In  general,  the  relationship  between  the  shape  of  the  vocal  tract  and  its  acoustic 
output  can  be  represented  by  a multidimensional  function  of  a multidimensional  argument 

y = f(x)  (3.9) 

where  x is  a vector  formed  by  the  coordinates  of  the  articulators,  y is  a vector  formed  by 
the  corresponding  acoustic  features,  and  f is  the  function  relating  these  vectors.  Given  an 
acoustic  measurement  yd,  the  problem  is  to  find  an  articulatory  state  x0  such  that  f(x0)  is 
the  best  match  to  yd.  In  other  words,  with  the  optimization  approach,  x0  is  the  solution  to 
the  nonlinear  optimization  problem: 
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x0  = minimal  argument  of  { ||  f(x)  — yd  ||  } (3.10) 

where  II  • II  is  a norm  on  the  acoustic  space. 

3.3.1  Strategy 

Speech  inverse  filtering  is  a “constrained  multidimensional  nonlinear  optimization 
problem.”  As  we  defined  in  Chapter  2,  section  2.2.1,  the  coordinates  of  the  tongue  body 
(tbodyx,  tbodyy),  tongue  tip  (tipx,  tipy),  lips  (lipp,  lipo),  jaw  (jaw),  and  hyoid  (hyoid) 
compose  the  multidimensional  articulatory  vector  x,  i.e., 

x = [tbodyx,  tbodyy,  tipx,  tipy,  lipp,  lipo,  jaw,  hyoid]  (3.11) 

Note  that  x is  an  8-dimensional  vector.  Usually,  the  velum  is  set  at  different  default 
positions  for  nasal,  non-nasal,  or  nasalized  phonemes,  but  it  can  be  optimized  for  some 
phonemes.  The  dimensions  of  the  lower  pharynx  are  also  allowed  to  be  optimized 
whenever  this  is  necessary. 

We  designate  the  articulatory  vector  as 

x = [xj,  x2,  ...  , xM]  (3.12) 

where  the  value  of  M represents  the  number  of  dimensions  of  the  articulatory  domain  to 
be  optimized.  As  mentioned  in  the  previous  paragraph,  M has  a value  of  eight.  For  nasal 
and  nasalized  sounds,  we  may  include  the  velum  as  an  additional  articulatory  parameter, 
i.e.,  M is  set  to  9.  For  middle  vowels,  some  back  vowels,  and  semivowels,  three  more 
parameters,  which  are  anterior-posterior  movements  of  K and  H,  glk  and  wh,  and  the 
height  between  K and  H,  hkl  (refer  to  Figure  2-3),  are  included,  i.e.,  M is  set  to  11.  To 
the  extremity,  one  more  parameter,  velum,  is  included,  i.e.,  M=12. 

The  acoustic  vector  is  composed  of  the  first  four  formant  frequencies,  i.e., 
y = [Fi>  F2,  F3,  F4].  The  cost  function  (error  distance)  is  derived  from  a comparison 
of  between  the  first  four  formant  frequencies  of  the  articulatory  model  and  the  first  four 
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formant  frequencies  determined  from  speech  analysis.  A percentage  of  the  weighted 
least-absolute- value  (lrnorm)  error  distance  is  defined  as: 


^FmOO)  = 


_ V Wi  I Fn»®  ~ FU 


i=  1 


(3.13) 


where  Fmi  is  the  i*  model-derived  formant  which  is  function  of  articulatory  vector,  Fti  is 

the  i*  target-frame  formant  estimated  from  the  analysis  of  speech  signal,  and  W;  is  the 
assigned  weight. 


The  constraints,  which  include  the  articulatory-to-acoustic  transformation  function 
f (equation  (3.9))  and  the  boundary  conditions  of  the  articulatory  parameters,  are 
described  as  follows: 

y m = f(x)  = f([xj,  x2,  ...  , xM]) 

= Fm(x)  = [Fml(x),  F^OO,  Fm4(x)]  (3.14) 

where  lbj  < Xj  < ubj,  j=l,  ...  ,M,  are  the  lower  and  upper  bounds  of  the  articulatory 
parameters,  and  the  subscript  m represents  the  model-derived. 

The  object  of  the  optimization  process  is  to  find  the  optimal  articulatory  vector 
that  generates  the  acoustic  vector  (model-derived)  as  close  to  the  desired  (target-frame)  as 

possible.  The  ideal  minimum  value  of  e(Fm(x))  is  0%,  but  some  approximations  used  in 

the  articulatory  model  (see  Chapter  2,  section  2.2)  make  this  value  hard  to  reach.  The  first 
approximation  is  related  to  the  articulatory  model.  A non-robust  representation  of  the 
lower  part  of  the  pharynx  and  the  tongue  tip-to-jaw  region  may  cause  some  deviations  on 
the  midsagittal  vocal  tract  outline.  The  second,  and  more  significant  deviation,  is  the 
uncertainty  of  the  sagittal  distance  to  cross-sectional  area  transformations.  Different 
empirical  transformation  formulas  can  be  found  in  the  literature  (Heinz  and  Stevens, 
1964;  Mermelstein,  1973;  Sundberg  et  al.,  1987).  The  final  one  is  the  area  to  formant 
frequency  conversion.  We  have  determined  that  an  error  criterion  requiring  the  final 
value  of  error  distance  function  to  be  less  than  1%  appears  adequate. 
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3.3.2  Procedure 

To  extract  the  articulatory  trajectories  from  a speech  sentence,  the  first  step  is  to 
obtain  a smoothed  formant  trajectory  from  the  speech  signal.  See  Chapter  4,  section  4.1, 
for  more  details  on  formant  trajectory  extraction.  Then  N target  frames  are  selected.  The 
target  frame  selection  is  based  on  the  results  of  the  speech  analysis,  which  include  the 
formant  trajectory,  the  location  of  the  word  endpoints,  and  the  estimated  phoneme 
boundaries  of  the  speech  signal.  Figure  3-2  shows  the  selected  target  frames  of  the 
formant  tracks  obtained  from  the  sentence  “We  were  away  a year  ago”  spoken  by  a male 
subject.  There  are  26  target  frames  selected  for  this  sentence.  Finally,  the  speech  inverse 
filtering  procedure  is  applied  to  each  target  frame  to  obtain  the  optimum  articulatory 
parameters. 

Figure  3-3  shows  the  block  diagram  of  the  speech  inverse  filtering  procedure. 
The  procedure  is  performed  frame-by-frame.  For  each  target  frame,  an  initial  value  of  the 
error  distance  function  (cost  function)  is  evaluated  from  the  initial  articulatory  vector. 
The  error  distance  function  evaluation  includes  the  computations  of  the  sagittal  distances 
and  the  section  lengths,  the  calculations  of  the  vocal  tract  cross-sectional  area  and  the 
acoustic  transfer  function,  the  decomposition  of  the  first  four  formants  from  the  acoustic 
transfer  function,  and  the  calculation  of  the  error  distance.  Then  the  simulated  annealing 
algorithm  controls  the  movement  of  the  search  path.  Each  movement  requires  the 
generation  of  a next  candidate  point,  the  error  distance  function  evaluation  for  the 
candidate  point,  and  the  decision  to  move.  After  a number  of  steps,  the  temperature  is 
lowered  and  a new  search  begins.  The  process  stops  if  the  near-global  minimum  is 
reached  or  the  maximum  allowed  number  of  function  evaluations  is  exceeded.  The 
speech  inverse  filtering  procedure  terminates  when  all  target  frames  are  optimized.  The 
articulatory  parameters  and  the  vocal  tract  cross-sectional  areas  of  all  the  optimized  N 
target  frames  can  be  saved  as  disk  file  for  later  use  or  can  be  direcdy  passed  to  the 
articulatory  synthesizer  for  synthesis. 
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Figure  3-2:  The  selected  target  frames  from  the  formant  tracks. 
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Note  that  each  target  frame  is  defined  as  a data  structure  in  a C program.  Table 
3-1  lists  the  components  of  the  data  structure.  The  first  four  formant  frequencies,  the 
frame  starting  time,  and  the  frame  duration  are  the  initial  components.  After  the 
minimum  error  distance  is  obtained,  the  first  section  area  of  the  nasal  tract,  the  optimal 
articulatory  coordinates,  the  section  lengths,  and  the  cross-sectional  areas  are  stored  as  the 
content  of  the  target-frame  data  structure. 

The  annealing  parameters  that  control  the  simulated  annealing  algorithm  include 
the  initial  temperature  T,  temperature  reduction  coefficient  rT,  number  of  steps  to  adjust 
the  step  length  vector  Ns,  number  of  step  adjustments  at  each  temperature  NT,  number  of 
successive  temperature  reductions  to  test  for  termination  Ne,  a small  constant  used  for 
terminating  criterion  r\,  and  maximum  number  of  function  evaluations  Ntot.  The 
analogues  between  the  annealing  process  and  the  articulatory  problem  can  be  identified  as 
following.  First,  the  percentage  of  the  weighted  least- absolute-value  (lrnorm)  error 
distance,  equation  3.13,  corresponds  to  the  energy  of  the  material.  The  articulatory 
vector,  equation  3. 1 1 or  3. 12,  corresponds  to  the  configuration  of  particles.  The  change  of 
articulatory  parameters  corresponds  to  the  rearrangement  of  particles.  Finding  a 
near-optimal  articulatory  vector  corresponds  to  finding  a low-energy  configuration.  The 
temperature  of  the  annealing  process,  T,  becomes  the  control  parameter  for  the  speech 
inverse  filtering  process.  Second,  the  Metropolis  algorithm  corresponds  to  the  random 
fluctuations  in  energy.  Third,  the  temperature  reduction  coefficient  rT  corresponds  to  the 
cooling  rate.  Fourth,  the  finite  number  of  moves  at  each  downward  control  temperature 
value,  Ns  • Nx,  corresponds  to  the  amount  of  time  spent  at  each  temperature. 

Reasonable  values,  found  after  some  experimental  tests,  of  the  parameters  (Table 
3-2)  are  used  as  defaults  for  the  optimization  process.  However,  a guideline  of  the 
optimization  process  is  given  in  Appendix  D. 
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Table  3-1: 

Components  of  the  target  frame  structure. 

Data  type 

Component 

Description 

short 

tfsetflag 

flag  for  optimization 

double 

ttime 

frame  starting  time 

double 

tdur 

frame  duration 

float 

ntla 

velopharyngeal  port  area 

structure  pointer 

areaf 

area  function 

structure  pointer 

vtlen 

section  lengths 

structure  pointer 

shape 

articulatory  coordinates 

structure  pointer 

tfptr 

target  formants 

structure  pointer 
structure  pointer 

next 

both  used  for  double  linked  list 

previous 

Table  3-2:  Default  annealing  parameter  values. 

Annealing  parameters 

Default  values 

T 

- Artificial  temperature  (as  control  parameter) 

0.1-  0.2  degrees 

rT 

- temperature  reduction  coefficient 

0.85 

NS 

- number  of  steps  to  adjust  the  step  length  vector 

20 

Nx 

- number  of  adjustments  at  each  temperature 

5 

Ne 

- number  of  successive  temperatures  to  test  for  stopping 

4 

11 

- termination  criterion 

0.005 

Ntot 

- total  number  of  function  evaluations 

5,001 

vi 

- step  length;  where  i=l,  2, ...,  M 

3.0 

96 


3.4  Results  and  Remarks 


3.4.1  Optimization  of  Vowels 

Appendix  A presents  the  articulatory  and  acoustic  characteristics  for  typical 
American  vowels.  The  midsagittal  vocal  tract  outline,  and  the  corresponding  vocal  tract 
cross-sectional  area  function  are  obtained  from  sustained  vowel  phonations  by  using  the 
simulated  annealing  algorithm.  From  Appendix  A,  we  can  see  that  the  simulated 
annealing  optimization  algorithm  works  well,  since  most  of  the  error  distances  are  less 
than  0.5%.  From  these  results,  we  arrive  at  the  following  observations: 

CD  Different  vowels  are  characterized  by  a different  set  of  resonant  frequencies 
(formants),  thus,  a different  vocal  tract  shape.  For  example,  front  vowels  (A,  i, 
ae,  £/)  are  characterized  by  a large  difference  between  F2  and  F,,  which  needs 
a large  back  cavity,  and  middle  vowels  (/x,  a,  d/)  and  back  vowels  (/u,  u,  a/) 
have  a small  difference  between  F2  and  Fl5  which  indicates  a narrow  back 
cavity. 

12  The  three  middle  vowels  (/x , a,  of)  and  the  low,  back  vowel  (/a/)  have  similar 
vocal  tract  shapes  (except  for  the  retroflexed  vowel,  /x/)  has  a more  significant 
tongue  curl-up,  which  results  in  a distinctly  low  F3. 

O For  all  middle  vowels  (/x,  a,  of)  and  some  other  vowels  (/as,  a,  ou /),  the  three 
articulatory  parameters  of  the  lower  pharynx,  i.e.,  wh,  glk,  and  hkl,  must  be 
optimized. 

S To  investigate  the  complexity  of  the  error  distance  function  for  each  vowel,  we 
use  the  same  initial  vocal  tract  configuration  for  the  optimization  process  for 
all  vowels.  Experience  with  the  optimization  process  indicates  that  the  error 
distance  function  for  the  high  front  vowels  (A,  if)  and  the  high,  back,  rounded 
vowels  (Aj,  u f)  converges  faster  than  for  other  vowels.  The  middle  vowel  (/x/) 
and  the  low,  back  vowel  (/a/)  have  the  slowest  convergent  rate  because  these 
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two  vowels  have  a more  complex  error  distance  function.  For  the  middle 
vowel  (/2T/),  the  curl  of  the  tongue  causes  a distinctly  low  F3,  which  may  result 
in  a slow  convergence  of  the  optimization  process.  To  articulate  the  low,  back 
vowel  (/a/),  the  tongue  needs  to  be  lowered  and  the  jaw  needs  to  be  opened  in 
order  to  widen  the  oral  cavity.  Also,  the  tongue  must  move  back  to  narrow  the 
pharyngeal  cavity.  These  articulations  may  make  the  error  distance  function 
complex  and  slow  its  convergence.  Notice  that  the  more  complex  error 
distance  function  needs  a higher  initial  temperature  and  more  evaluations, 
since  it  might  have  more  local  minima  to  escape.  See  Appendix  D for  a 
guideline  of  the  optimization  process. 

3.4.2  Optimization  for  a Sentence 

In  Appendix  D,  the  simulated  annealing  algorithm  is  applied  to  perform  the  speech 
inverse  filtering  for  two  speech  signals  that  were  obtained  from  one  speech  token  spoken 
by  two  male  subjects.  The  following  are  some  general  observations  regarding  sentence 
optimization: 

CD  There  are  two  semivowels  (/w,  j /)  in  the  speech  token  analyzed.  According  to 
the  phonetic  classifications,  these  two  semivowels  have  been  categorized  as 
glides.  As  we  can  see  from  the  formant  tracks  in  Figure  3-2,  the  formants, 
especially  the  second  and  third  formants,  glide  up  or  down  to  the  next  vowel. 
It  is  seen  that  all  the  /w/  glides  (frames  1,  6,  and  10  of  subject  A,  and  frames  1, 
4,  and  8 of  subject  B)  have  three  places  of  articulation  in  common:  the 
protruded  lips,  high  tongue  tip,  and  high  back  tongue.  However,  some  of  these 
have  a significant  upward  tongue  tip  curl.  As  frame  16  of  subject  A and  frame 
13  of  subject  B show,  the  tongue  blade  of  the  semivowel  /j/  approximates  the 
palate  and  has  been  called  a palatal  glide.  In  summary,  the  vocal  tract  shape  of 
glides  “glide”  to  the  next  vowel  with  a fast  movement  of  the  tongue  and  lips. 
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El  Both  subjects  have  quite  similar  vocal  tract  shapes  for  vowel  /i/  and  for  vowel 
/r/,  respectively.  The  lip  opening  decreases  during  the  transition  from  vowel 
N to  semivowel  /w/  (frame  sequences  3-4-5-6  of  subject  A,  and  2-3-4  of 
subject  B)  in  order  to  have  protruded  lips  for  /w/. 

02  The  diphthong  /ei/  ending  with  vocal  tract  shape  for  M entails  tongue 
movement  forward  up  from  the  /e/  (frame  sequences  11-12-13-14  of  subject 
A,  and  9-10—11  of  subject  B).  The  diphthong  /ou/  ending  with  the  vocal  tract 
shaped  for  /u/  entails  tongue  movement  back  and  up,  concurrent  with  lip 
protrusion  (frame  sequences  24-25-26  of  subject  A,  and  20-21-22  of  subject 
B). 

5 1 The  voiced  stop  /g/  is  usually  classified  as  a velar  consonant.  From  frame  22 
of  subject  A and  frame  18  of  subject  B,  classified  as  the  palatal-velar 
consonant  is  more  correct  (Borden  and  Harris,  1980,  p.  117). 

El  The  simulated  annealing  algorithm  performs  well.  On  the  average,  over  87% 
of  the  total  frames  have  an  error  distance  less  than  0.1%. 

3.4.3  Remarks 


The  above  results  illustrate  the  usefulness  of  the  simulated  annealing  algorithm, 
which  has  proved  to  be  efficient  and  very  flexible  in  dealing  with  the  problems  that  are 
inherent  to  the  acoustic-to-articulatory  transformation.  However,  the  selection  of 
parameters  for  the  annealing  schedule  is  an  obstacle  for  the  simulated  annealing 
algorithm,  since  we  know  little  about  the  relation  between  the  argument  domain 
(articulatory  vector)  and  the  technology  (the  algorithm).  The  guideline  in  Appendix  D 
and  the  default  annealing  parameter  values  in  Table  3-2  are  considered  a good  procedure 
at  this  time.  The  evaluation  of  the  error  distance  function  is  the  most  computationally 
intensive  part  of  the  program.  On  the  average,  2000  computations  per  minute  are  needed 
for  a Sun  SPARCstation  10  machine. 


99 


3.5  Summary 

We  have  described  the  simulated  annealing  optimization  algorithm  in  detail  after 
reviewing  the  derivations  of  the  vocal  tract  cross-sectional  area.  The  simulated  annealing 
algorithm  is  based  on  the  Corana  et  al.  (1987)  approach.  The  articulatory  vector  defines 
the  set  of  parameters  to  be  optimized.  The  cost  function  is  a percentage  of  the  weighted 
least-absolute-value  error  distance.  It  defines  a comparison  of  the  first  four  formant 
frequencies  between  the  model-generated  and  the  target-frame  (from  speech  analysis).  A 
1%  error  criterion  was  found  to  give  satisfactory  results.  Once  the  optimum  articulatory 
vector  is  obtained,  the  articulatory  model  determines  the  vocal  tract  cross-sectional  area 
function,  which  in  turn  is  used  by  the  articulatory  speech  synthesizer.  Results  and 
discussion  of  speech  inverse  filtering  for  twelve  typical  American  vowels  and  two 
sentences  were  presented  in  Appendix  A and  Appendix  D,  respectively.  Default 
annealing  parameters  that  control  the  simulated  annealing  algorithm  were  also  given.  The 
simulated  annealing  algorithm  has  proven  to  be  efficient  and  very  flexible  in  dealing  with 
the  problems  that  are  inherent  to  speech  inverse  filtering. 


CHAPTER  4 

SOFTWARE  SYSTEM  FOR  ARTICULATORY  SYNTHESIS 

The  articulatory  synthesis  software  embodies  several  characteristics,  which  are 
described  here.  The  synthesis  requires  a set  of  features  that  are  measured  from  the 
original  speech  signal.  The  major  features  are  the  formant  tracks  and  the  pitch  contour. 
In  addition  a model  of  the  vocal  tract  is  obtained  via  inverse  filtering,  i.e.,  performing  an 
acoustic-to-articulatory  transformation.  An  excitation  waveform  model  must  also  be 
constructed.  Finally,  all  of  these  characteristics  form  the  complete  articulatory  synthesis 
model. 

Figure  4-1  is  the  block  diagram  of  the  articulatory  synthesizer  software  system. 
The  software  program  is  called  ARTM  (ARTiculatory  Model).  All  modules,  except  the 
analysis  phase,  are  implemented  with  devguide,  XView,  and  C functions.  This  chapter 
describes  the  ARTM  software  program. 

4.1  Analysis  Phase 

We  use  the  ESPS  program  formant  to  extract  the  first  four  formant  frequencies 
from  the  speech  signal.  The  block  diagram  of  Figure  4-2  illustrates  how  the  program 
formant  estimates  formant  frequencies.  Preemphasis  is  applied  in  order  to  compensate 
partially  for  the  voice  source  and  the  radiation  characteristics.  A rectangular  window  with 
a window  length  (frame  length)  of  250  speech  samples  (for  a 10  kHz  sampling  frequency) 
is  used.  The  frame  rate  is  100  frames  per  second.  The  covariance  method  of  12-order 
LPC  is  applied  to  each  frame  of  sampled  speech.  The  formant  frequency  candidates  are 
obtained  by  solving  for  the  roots  of  the  linear  prediction  polynomial.  Dynamic 
programming  with  frequency  continuity  constraints  is  used  to  optimize  the  formant 
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Figure  4-1:  The  block  diagram  of  the  articulatory  synthesis  software  system. 
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Figure  4-2: 


Block  diagram  of  the  processing  procedure  of  ESPS  program  formant. 
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trajectory  estimates  by  using  a modified  Viterbi  algorithm.  Note  that  the  formant 
frequencies  estimated  from  the  program  formant  are  saved  in  ESPS  data  format.  Another 
ESPS  program  pplain  is  used  to  convert  the  data  from  ESPS  format  to  ASCII  format.  We 
use  a two-pass  method  (Childers  and  Lee,  1991)  to  estimate  the  pitch  contour  and  the  LF 
model  parameters.  The  two-pass  method  is  implemented  in  MATLAB. 

4.2  Speech  Inverse  Filtering  Phase 

The  purpose  of  speech  inverse  filtering  is  to  obtain  the  vocal  tract  cross-sectional 
area  for  the  articulatory  synthesizer.  This  topic  was  discussed  in  Chapter  3.  As  a brief 
review,  we  first  select  one  or  more  target  frames  based  on  the  results  of  the  speech 
analysis,  which  include  the  location  of  the  word  endpoints  and  the  estimated  phoneme 
boundaries  of  the  speech  signal.  Then  we  apply  the  simulated  annealing  optimization 
algorithm  to  each  target  frame  to  calculate  the  optimum  articulatory  vector  and  the 
corresponding  vocal  tract  cross-sectional  area  function.  The  speech  inverse  filtering 
phase  in  Figure  4-1  illustrates  this  scheme.  This  software  also  has  stored  area  functions 
of  five  Russian  vowels  (Fant,  1960).  These  data  can  be  used  for  synthesis  if  desired. 

Figure  4-3  shows  the  user  interactive  windows  used  during  the  speech  inverse 
filtering  phase.  Upon  starting  the  ARTM  program,  the  Articulatory  Speech  Synthesizer 
(Main  Window)  window  appears  as  the  main  control  window.  The  canvas  is  used  to  draw 
the  midsagittal  vocal  tract  outline  and  the  corresponding  cross-sectional  area  function,  and 
to  display  the  first  four  formants  of  the  target-frame  and  of  the  model-generated,  the  error 
distance,  and  the  articulatory  coordinates.  The  control  buttons  are  used  to  load,  unload, 
and  save  various  data  files,  to  exit  from  the  ARTM  program,  and  to  call  up  other  windows 
such  as  Excitation  Source  Window,  Articulatory  Synthesis  Window,  and  several  popup 
windows. 

The  Formant  Tracks  Display  Window  presents  the  formant  tracks,  which  are 
loaded  from  a formant  data  file  that  was  obtained  during  the  analysis  phase.  The  user  can 
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set  any  target  frame  by  manual  placement.  Press  the  left  mouse  button  to  mark  one 
formant  data  set  as  a target  frame.  Press  the  middle  mouse  button  to  unmark  a target 
frame.  The  right  mouse  button  brings  up  a menu  with  options  for  loading  and  saving  the 
target-frame  marks,  unmarking  all  target-frame  marks,  and  saving  the  target  frames  into  a 
file  (target  file). 

The  Articulatory  Position  Settings  popup  window  was  implemented  to  configure 
the  initial  articulatory  configuration  for  each  target  frame  for  the  optimization  process. 
By  adjusting  the  settings  of  the  nine  articulatory  sliders,  which  control  the  positions  of 
articulators,  various  vocal  tract  configurations  can  be  constructed.  The  corresponding 
cross-sectional  area  function,  the  model-derived  formants,  and  the  error  distance  (cost 
function  value)  are  updated  simultaneously  on  the  canvas  of  the  main  window.  This 
feature  facilitates  the  construction  of  the  initial  vocal  tract  configuration  that  leads  to  a 
rapid  convergence  of  the  optimization  process.  In  addition,  this  popup  window  has 
several  control  buttons  that  allow  the  user  to  access  any  target  frame,  to  draw  sagittal  grids 
on  the  vocal  tract  outline,  and  to  clear  the  canvas  of  the  main  window.  Other  posted 
messages  include  the  number  of  total  target  frames,  the  frame  number  of  the  current 
target,  and  the  frame  time  information. 

The  other  two  popup  windows  of  Figure  4-3  are  related  to  the  optimization 
process.  The  Articulatory  Optimization  Setup  Window  specifies  the  dimension  of  the 
articulatory  vector  being  used  in  the  optimization  process,  sets  the  desired  nasalization  of 
vowels,  and  return  to  the  original  midsagittal  vocal  tract  configuration  if  unsatisfactory 
optimization  is  obtained.  The  Simulated  Annealing  Window  controls  the  initial  simulated 
annealing  parameter  settings. 

In  Figure  4-3,  one  optimized  target  frame,  which  is  the  phoneme  /i/  in  the 
sentence  “We  were  away  a year  ago”  spoken  by  a male  subject,  is  displayed.  The  final 
error  distance  for  this  frame  is  0.01534%. 
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4.3  Excitation  Generation  Phase 

The  excitation  generation  phase  sets  up  the  excitation  frames  for  the  articulatory 
synthesizer.  Figure  4-1  illustrates  the  scheme  for  specifying  the  excitation  frames.  For  a 
voiced  glottal  excitation  source,  the  specifications  include  the  LF  model  parameters,  the 
jitter  and  shimmer  models,  and  the  aspiration  noise  model.  For  an  unvoiced  excitation 
source,  the  place  for  the  turbulence  flow  generation  and  other  related  characteristics  are 
specified.  For  source-tract  interaction,  the  Foster-chain  circuit  model  of  the  subglottal 
system  and  the  model  of  the  glottal  area  function  are  specified. 

Figure  4-4  shows  the  excitation  related  setup,  control,  and  display  windows.  The 
Excitation  Source  Window  is  the  major  control  and  display  window.  It  is  activated  by 
pressing  the  Exci  button  in  the  main  window  (see  Figure  4—3).  The  excitation  place, 
mode,  and  time  message  are  specified  for  each  excitation  frame.  The  number  of  total 
excitation  frames  and  the  frame  number  of  the  current  excitation  are  posted  at  the  bottom 
right  of  this  window.  One  check  box  labeled  with  Included  indicates  the  models  used  for 
the  excitation,  which  include  the  jitter  and  shimmer  models,  the  aspiration  noise  model, 
and  the  subglottal  system  model.  Several  control  buttons  provide  the  means  to  load  and 
save  the  excitation  specification  files,  a method  for  accessing  and  editing,  and  a method 
for  displaying  waveforms.  Four  waveforms  are  displayed  on  the  canvas:  the  differential 
glottal  waveform  (LF  model),  the  glottal  pulse  (the  integral  of  the  LF  model),  the  power 
spectral  density  of  the  glottal  pulse,  and  the  aspiration  noise  waveform.  If  the  subglottal 
system  is  included  then  the  glottal  area  pulse  is  displayed  instead  of  the  aspiration  noise 
waveform. 

Several  popup  windows  allow  the  user  to  specify  the  parameters  for  the  excitation 
source,  source-tract  interaction,  and  subglottal  models.  The  Voiced  Parameters  popup 
window  includes  the  LF  model  parameters  and  the  gain  settings.  The  jitter  and  shimmer 
generation  can  be  controlled  by  the  Jitter  and  Shimmer  parameter  settings  popup  window. 
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The  Aspiration  noise  parameters  popup  window  provides  for  the  specification  of  the 
parameters  for  aspiration  noise  generation.  The  effective  glottal  area  time  function  and 
the  cascaded  RLC  resonance  modules  of  the  subglottal  system  are  specified  by  the 
Subglottal  mode  parameters  popup  window.  Finally,  the  Turbulence  noise  parameters 
popup  window  controls  the  characteristics  of  the  turbulence  flow  generation. 

Figure  4—4  shows  one  excitation  frame  example.  This  excitation  frame  specifies 
the  parameters  for  generating  a mixed  excitation  source  with  the  subglottal  system 
coupling.  A raised  cosine  glottal  area  model  is  selected.  Other  options  for  the  glottal  area 
model  are  triangle  and  sinusoidal.  The  major  window  canvas  displays  the  differential 
glottal  waveform,  the  glottal  pulse,  the  glottal  pulse  power  spectrum,  and  the  glottal  area 
pulse.  For  synthesizing  a sentence,  one  or  more  excitation  frames  are  specified  according 
to  the  information  and  data  obtained  from  the  analysis  phase. 

4.4  Synthesis  Phase 

Once  the  excitation  frames  and  the  optimized  target  frames  are  established,  the 
Articulatory  Synthesis  Window  (see  Figure  4-5)  is  called  up  by  pressing  the  Syn  button 
on  the  main  window.  This  display  window  is  divided  into  twelve  subareas.  During  the 
synthesis  of  speech  these  subareas  are  used  to  display  the  following  messages  and 
waveforms: 

CD  the  target-  and  excitation-frame  messages, 

12  the  vocal  tract  cross-sectional  area  function,  the  acoustic  transfer  function,  and 
the  midsagittal  vocal  tract  outline  of  the  current  target  frame, 

[3]  the  excitation  waveform  and  its  power  spectrum, 

S the  pressure  and  the  volume  velocity  waveforms  at  different  places  in  the  vocal 
tract, 

[5]  the  articulatory  trajectories,  and 

ED  the  synthetic  speech  waveform. 
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One  control  button  labeled  with  View  is  used  to  activate  the  Target  Frames  View 
Setup  popup  window  (not  shown  in  Figure  4-5).  This  popup  window  enables  the  user  to 
investigate  the  acoustic  transfer  function  for  different  vocal  system  structures  and  acoustic 
characteristics.  For  example,  setting  the  nasal  coupling  area  and  the  glottal  conditions 
constructs  the  different  vocal  system  structures,  and  setting  the  number  of  vocal  tract 
sections  and  the  location  of  the  excitation  specifies  the  various  acoustic  characteristics. 
Note  that  the  glottal  conditions  specify  the  glottis  as  open  or  closed.  If  the  glottis  is  open, 
the  subglottal  system  and  the  glottal  impedance  are  coupled  to  the  vocal  tract.  Several 
acoustic  transfer  functions  can  be  displayed  overlapping  each  other  by  specifying  the 
exclusive  setting  glyph,  labeled  with  Display  overlap,  at  On  position.  This  feature  enables 
the  investigation  and  comparison  of  the  acoustic  transfer  function  for  various  vocal 
system  structures  and  acoustic  characteristics.  Several  control  buttons  on  this  popup 
window  provide  the  user  access  to  any  target  frame  and  to  clear  the  canvas  of  the 
synthesis  window. 

Either  the  vocal  tract  cross-sectional  areas  or  the  articulatory  parameters  can  be 
interpolated  between  the  current  target  frame  and  the  next  target  frame  during  the 
synthesis  of  speech.  Two  interpolation  functions  are  used:  linear  and  arctan.  If  the  arctan 
function  is  selected,  the  Arctan  interpolation  parameters  popup  window  (not  shown  in 
Figure  4-5)  is  called  up  for  setting  the  start  transition  time,  the  transition  point,  and  the 
transition  rate.  The  stack  setting  glyph,  labeled  with  Interpolation,  specifies  the 
interpolation  method.  The  exclusive  setting  glyph,  labeled  with  Syn.  Samp.  Freq.,  allows 
the  user  to  select  the  synthesis  sampling  rate.  Another  exclusive  setting  glyph,  labeled 
with  VT  section  No.,  controls  the  number  of  vocal  tract  sections  being  used  during  the 
synthesis  of  speech. 

Figure  4-5  shows  the  messages  and  waveforms  when  synthesizing  the  sentence 
“We  were  away  a year  ago.”  The  vocal  tract  cross-sectional  areas  are  interpolated  by  the 
arctan  function.  A 60  kHz  sampling  rate  and  60  sections  of  the  vocal  tract  are  used  in  this 
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example.  The  monitored  pressure  and  volume  velocity  waveforms  are  at  sections  20,  30, 
40,  and  50  of  the  vocal  tract.  The  Choices  popup  window  provides  the  user  with  a method 
for  specifying  the  articulatory  trajectories  that  are  to  be  displayed  during  the  synthesis  of 
the  speech. 


4.5  Summary 

An  articulatory  synthesis  software  program  called  ARTM  was  implemented  with 
devguide,  XView,  and  C functions.  The  formant  tracks  and  the  pitch  contour  were 
extracted  from  the  speech  signal  using  the  ESPS  and  Matlab  programs  in  the  analysis 
phase.  In  the  speech  inverse  filtering  phase,  several  target  frames  were  selected  based  on 
some  time-  and  frequency-domain  analysis.  The  articulatory  model  was  used  to  construct 
the  initial  configuration  of  the  vocal  tract.  Then  speech  inverse  filtering  was  performed 
with  the  simulated  annealing  algorithm  to  obtain  the  vocal  tract  cross-sectional  area. 
Based  on  the  pitch  contour  and  LF  parameters  obtained  in  the  analysis  phase,  the 
excitation  waveform  model  were  constructed.  Several  features,  including  jitter  and 
shimmer,  aspiration  noise,  turbulence  noise  source,  the  subglottal  system,  and  the  glottal 
area  model,  provide  various  characteristics  of  the  excitation  generation.  Finally,  the 
synthesis  phase  constructs  the  speech  signal.  The  user  interface  displays  various 
waveforms  during  the  synthesis  process.  The  synthesis  procedure  includes  two  means  for 
interpolating  vocal  tract  configurations,  a method  for  selecting  the  number  of  vocal  tract 
sections,  and  a means  for  specifying  the  synthesis  sampling  frequency. 


CHAPTER  5 
EXPERIMENTS 


To  verify  the  analysis  schemes  (derivation  of  the  articulatory  parameters  from  the 
speech  waveform)  and  to  validate  the  synthesis  schemes  (generation  of  the  acoustic 
speech  waveform  from  the  articulatory  parameters),  speech  tokens  were  synthesized  with 
our  articulatory  synthesis  tool.  The  speech  tokens  consisted  of  the  sentence  “We  were 
away  a year  ago.”  Several  experiments  were  conducted  to  study 

Q]  the  effect  of  spatial-  and  time-domain  sampling  (Experiment  A), 

\2\  the  effect  of  different  interpolation  schemes  (Experiment  B), 

[3]  the  effects  of  different  glottal  area  models  and  of  different  maximal  opening 
areas  of  the  glottis  (Experiment  C), 

0 and  the  effect  of  different  source  spectral  tilt  specifications  and  waveform 
shapes  (Experiment  D) 

on  the  synthesis  quality.  Two  additional  experiments  were  conducted  to  study  the  effect 
of  sinus  cavities  and  opening  area  of  velopharyngeal  port  on  nasalized  vowels 
(Experiment  E)  and  to  verify  the  turbulence  noise  source  model  (Experiment  F). 

5.1  Experiment  A 

As  pointed  out  by  Wakita  and  Fant  (1978),  the  time-domain  simulation,  in 
particular  for  the  articulatory  synthesizer,  should  produce  a spectral  distortion  due  to  the 
frequency  warping  and  the  shifts  of  formants.  The  magnitude  of  the  spectral  distortion 
has  been  shown  to  depend  on  two  simulation  parameters,  the  number  of  vocal  tract 
sections  (spatial-domain  sampling)  and  the  sampling  frequency  (time-domain  sampling) 
(Maeda,  1982a).  Experiment  A was  conducted  to  study  the  effects  of  varying  these  two 
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simulation  parameters  on  the  quality  of  the  synthesized  speech.  The  purpose  was  to 
determine  appropriate  values  for  these  two  simulation  parameters  to  produce  synthetic 
speech  with  perceptually  insignificant  spectral  distortion. 

Let  SN  denote  the  number  of  vocal  tract  sections  and  Fs  denote  the  sampling 
frequency.  Several  variations  of  SN  were  used:  10,  20  ,30,  and  60  sections.  The  sampling 
frequency,  Fs,  was  varied  from  10  kHz  to  60  kHz  with  10  kHz  steps.  In  total,  the 
combinations  of  SN  and  Fs  generated  24  synthetic  speech  signals.  A linear  interpolation 
of  the  vocal  tract  cross-sectional  area  was  used.  The  parameter  values  for  the  LF  source 
model  were  specified  as  tp=41%,  te=55%,  tc=58%,  and  ta=0.4%  of  the  pitch  period,  T0. 
These  parameter  values  were  adopted  from  Childers  and  Ahn  (1994). 

Figure  5-1  shows  the  original  and  synthetic  speech  signals  and  wideband 
spectrograms.  Only  one  synthetic  speech  signal  is  shown  in  this  figure  since  all  of  the 
synthetic  speech  signals  appear  similar.  Figure  5-1  (b)  shows  the  synthetic  speech  signal 
and  wideband  spectrogram  with  SN=60  sections  and  Fs=60  kHz,  the  highest  spatial-  and 
time-domain  resolution  provided  in  our  synthesis  tool.  It  can  be  seen  that  some  artifacts 
occur  during  the  transitions,  for  example,  the  second  and  fourth  formant  transitions  from 
N to  /w/  and  the  third  formant  transition  from  /g/  to  /o/.  The  linear  interpolation  of  the 
vocal  tract  cross-sectional  area  may  cause  these  artifactual  formant  transitions.  This 
means  that  the  linear  interpolation  of  the  area  function  may  not  capture  the  formant 
dynamics  well.  A highly  resonant  fifth  formant  appeared  in  the  wideband  spectrogram  of 
the  synthetic  speech.  There  are  also  mismatches  between  the  bandwidths  and  intensities 
of  the  formants  of  the  synthetic  and  original  speech  signals.  Two  reasons  may  explain 
these  faults.  One  reason  is  that  the  assumptions  concerning  the  losses  in  the  vocal  tract 
model  may  not  match  the  actual  losses  for  the  speakers  used  in  this  study.  Another  reason 
is  that  the  optimization  scheme  does  not  determine  the  bandwidths  and  intensities  of  the 
formants.  Only  the  first  four  formants  are  determined.  This  is  apparent  in  Figure  5-1 
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where  an  energy  mismatch  appears  between  the  synthetic  and  original  speech,  resulting  in 
some  obvious  errors  in  the  amplitude.  The  same  fault  has  been  reported  in  the  Schroeter 
and  Sondhi  (1994)  study  . 

Figure  5-1  (c)  presents  the  wideband  spectrogram  of  synthetic  speech  with  SN=60 
sections  and  Fs=10  kHz.  The  frequency  warping  distortion  can  be  seen  above  2 kHz, 
resulting  in  an  audible  “ringing.”  In  Figure  5-1  (d),  the  frequency  warping  distortion  has 
been  reduced  between  2 kHz  and  3.5  kHz  by  increasing  the  sampling  frequency  from  10 
kHz  to  20  kHz.  Note  that  only  20  vocal  tract  sections  were  used  in  this  case.  Playback  of 
the  synthetic  speech  via  headphones  indicated  no  obvious  “ringing”  effect  when  Fs>20 
kHz.  In  section  2.4.2,  Chapter  2,  we  concluded  that  a ten-section  cross-sectional  area 
function  was  not  enough  to  represent  the  acoustic  characteristics  of  the  vocal  tract.  This 
conclusion  was  confirmed  by  listening  to  the  synthetic  speech  for  SN=10  sections  via 
headphones.  An  obvious  change  in  the  voice  characteristics  was  heard  due  to  the  dramatic 
shifts  of  the  formants  for  a ten-section  vocal  tract.  We  noted  that  the  shifts  in  the  formants 
exist  regardless  of  the  value  of  Fs. 

Thus,  the  higher  the  values  of  the  simulation  parameters,  SN  and  Fs,  the  lower  the 
spectral  distortion.  However,  the  amount  of  computation  is  proportional  to  the  product  of 
Sn  and  Fs.  In  summary,  SN=20  sections  and  Fs=20  kHz  appears  to  produce  synthetic 
speech  with  a perceptually  insignificant  spectral  distortion. 

One  weakness  observed  in  the  articulatory  synthesis  was  an  “echoing”  effect, 
which  was  present  in  every  token,  even  for  SN=60  sections  and  Fs=60  kHz.  This 
phenomenon  is  perhaps  due  to  the  superposition  of  the  energy  of  the  current  excitation 
frame  with  the  residual  energy  from  the  previous  excitation  frame  due  to  the  memory  of 
the  synthesis  filter  (Childers,  1995).  In  the  formant  synthesizer,  a parallel  structure  of 
multi-filter  banks  was  proposed  to  solve  this  problem  (Lalwani  and  Childer,  1991). 
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However,  in  the  digital  time-domain  articulatory  synthesizer,  this  problem  remains 
unsolved. 


5.2  Experiment  B 

This  experiment  was  conducted  to  study  the  effect  of  different  interpolation 
functions  on  the  speech  synthesis  quality.  Figure  5-2  compares  the  wideband 
spectrograms  of  the  original  speech  and  of  the  synthetic  speech  for  different  interpolation 
schemes.  In  this  figure,  results  are  shown  for  the  original  speech  (Figure  5-2(a)),  the 
linear  interpolation  of  vocal  tract  cross-sectional  area  (Figure  5-2(b)),  the  arctan 
interpolation  of  vocal  tract  cross-sectional  area  (Figure  5-2(c)),  the  linear  interpolation  of 
articulatory  parameters  (Figure  5-2(d)),  and  the  arctan  interpolation  of  articulatory 
parameters  (Figure  5-2(e)).  The  parameter  values  for  the  arctan  interpolation  function, 
point  of  transition,  to,  and  transition  rate,  y,  were  specified  at  midpoint  of  the  frame 
duration  and  70  units  per  second,  respectively.  In  this  experiment,  a 60-section  vocal  tract 
and  60  kHz  sampling  frequency  were  used.  The  same  LF  parameters  as  specified  in 
Experiment  A were  used. 

From  the  wideband  spectrograms  and  the  playback  of  the  synthetic  speech,  no 
clear  difference  was  found  for  the  linear  or  for  the  arctan  interpolation  of  vocal  tract 
cross-sectional  area  (Figure  5-2(b)  and  (c)).  However,  a problem  was  noted  with  the 
interpolation  of  the  articulatory  parameters  by  either  method.  In  certain  regions,  the 
interpolation  between  two  sets  of  articulatory  parameters  generated  short-duration, 
noise-like  speech.  This  results  in  a white-noise-like  spectrum,  seen  as  the  high  intensity 
vertical  bars  in  Figure  5-2(d)  and  (e).  Several  audible  click  artifacts  were  heard  from  the 
playback  of  the  synthesized  speech  via  headphones.  A similar  artifact  was  reported  by 
Gupta  and  Schroeter  (1993),  who  optimized  the  parameters  by  arctan  function  for  each 
articulator.  The  physiologically  possible  but  dynamically  unnatural  configurations 
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obtained  during  the  interpolation  of  articulatory  parameters  may  contribute  to  these 
artifacts. 


5.3  Experiment  C 

There  are  two  main  kinds  of  interactive  processes  during  phonation:  mechanical 
interaction,  whereby  the  glottal  flow  affects  the  vibrating  pattern  of  the  vocal  folds,  and 
acoustic  interaction,  whereby  the  waveform  shape  of  glottal  flow  is  affected  by  the  load  of 
the  vocal  tract  (Lin,  1990).  Since  the  glottal  source  waveform  was  modeled  by  the  LF 
model,  only  acoustic  interaction  was  considered  in  our  study.  The  acoustic  interaction 
effect  can  be  achieved  in  two  ways:  controlling  the  glottal  impedance  by  using  a glottal 
area  model  or  adjusting  the  shape  of  the  glottal  waveform  by  incorporating  an  equivalent 
effect  into  the  source  model  (Childers,  1995).  We  take  the  former  approach.  In  Chapter 
2,  section  2.3.5,  an  interactive  source  model  that  consists  of  the  unified  glottal  excitation 
model,  the  subglottal  model,  and  the  glottal  area  model  was  proposed.  Using  this 
proposed  interactive  model,  we  conducted  the  experiment  in  two  parts.  The  first  part 
studied  the  effect  of  different  glottal  area  models,  and,  the  second  part,  the  effect  of 
different  maximal  opening  areas  of  the  glottis  on  the  synthesis  quality. 

In  the  first  part,  the  wideband  spectrograms  of  the  original  speech  and  of  the 
synthetic  speech  for  different  glottal  area  models  were  compared  (Figure  5-3).  As  a 
reference,  the  wideband  spectrogram  of  synthetic  speech  with  no  glottal  impedance  and 
no  subglottal  system  is  shown  in  Figure  5-3(b).  Figure  5-3(c),  (d),  and  (e)  show  the 
wideband  spectrograms  for  synthetic  speech  with  triangular,  sine,  and  raised-cosine 
glottal  area  models,  respectively.  Equal  opening  and  closing  durations  were  specified  for 
all  three  glottal  area  models  (refer  to  the  glottal  area  waveforms  in  Figure  2-17).  The 
maximal  opening  area  of  the  glottis  was  set  at  20  mm2  (0.2  cm2).  A 60-section  vocal 
tract  and  60  kHz  sampling  frequency  were  specified  in  this  experiment.  The  same  LF 
parameters  as  specified  for  Experiment  A were  used. 
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Figure  5-3:  Wideband  spectrograms  of  the  original  speech  and  of  the  synthetic 


speech  with  different  glottal  area  models. 
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An  increasing  in  the  bandwidths  of  the  formants  in  the  interactive  model  (Figure 
5-3(c),  (d),  and  (e))  is  apparent.  This  means  that  the  interactive  model  has  induced  more 
damping  of  the  formants  than  the  non-interactive  model  (Figure  5-3(b)).  This  is 
reasonable  because  the  vocal  tract  was  loaded  by  the  finite  glottal  impedance  in  the 
interactive  model.  This  result  is  in  agreement  with  Badin  and  Fant  (1984)  and  Lin’s 
(1990)  acoustic  analysis.  From  the  playback  of  the  synthetic  speech,  the  triangular  and 
raised-cosine  glottal  area  models  sound  more  smooth  and  are  clearer  than  the  sinusoidal 
model.  One  reason  for  this  is  that  the  glottal  area  waveforms  of  the  triangular  and 
raised-cosine  models  are  closer  to  the  measured  waveforms  of  the  glottal  area  from  high 
speed  film  (refer  to  Figure  2-16). 

In  our  case,  we  found  that  the  synthetic  speech  with  the  non-interactive  model 
sounds  more  natural  than  with  interactive  model.  Several  factors  may  explain  this 
contradiction.  It  is  well  known  that  the  LF  model  can  account  for  one  interaction  effect, 
namely,  pulse-skewing  induced  by  the  inertive  loading  of  the  sub-  and  supraglottal 
acoustic  systems.  In  our  proposed  interactive  model,  this  may  result  in  an  overskewing 
effect  in  the  glottal  pulse.  This  is  one  of  the  weaknesses  of  the  LF  model,  i.e.,  it  is 
difficult  to  include  other  major  interaction  effects  by  using  LF  model,  particularly,  in 
articulatory  synthesis.  No  relationship  between  the  time-varying  glottal  area  and  the 
glottal  pulse  of  LF  model  was  used  in  our  interactive  model.  This  may  result  in  inferior 
quality  for  the  synthetic  speech.  By  listening  to  the  synthetic  speech,  it  was  possible  to 
hear  that  the  voice  characteristics  of  some  parts  of  the  speech  were  changed  more  than 
other  parts.  Consequently,  certain  phonemes  may  require  larger  glottal  opening  areas  than 
other  phonemes. 

For  the  second  part  of  this  experiment.  Figure  5^1  presents  the  wideband 
spectrograms  of  the  synthetic  speech  for  different  maximal  opening  areas  of  the  glottis 
when  the  raised-cosine  glottal  area  model  was  used.  Four  values  of  the  maximal  opening 
area  of  the  glottis  were  specified  for  synthesis:  10,  15,  20,  and  30  mm2.  It  can  be  seen 
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of  the  glottis 
specified  at 
30  mm2. 


Figure  5-4:  Wideband  spectrograms  of  the  synthetic  speech  with  different  maximal 

opening  areas  of  the  glottis  when  raised-cosine  model  was  used. 
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that  the  bandwidths  of  the  formants  increase  with  an  increase  of  the  maximal  opening  area 
of  the  glottis.  The  speech  sounded  unnatural  when  the  maximal  opening  area  of  the 

glottis  was  set  at  30  mm2.  The  subglottal  formants  may  contribute  some  effects  on  the 
synthetic  speech  when  the  maximal  opening  area  of  the  glottis  is  large. 

5.4  Experiment  D 

Recent  research  has  shown  that  the  characteristics  of  the  glottal  source  waveform, 
such  as  the  glottal  pulse  width,  glottal  pulse  skewness,  and  the  abruptness  of  glottal 
closure  are  important  for  speech  synthesis  (Childers,  1995;  Childers  and  Ahn,  1994; 
Childers  and  Hu,  1994;  Childers  and  Lee,  1991;  Fant,  1993;  Fant  et  al.,  1985;  Fujisaki  and 
Ljungqvist,  1986;  Klatt  and  Klatt,  1990).  Experiment  D was  conducted  in  two  parts.  The 
first  part  was  to  study  the  effect  of  the  abruptnesses  of  glottal  closure  on  the  synthesis 

quality.  In  the  second  part,  the  effect  of  different  glottal  waveforms  on  the  synthesis 
quality  was  studied. 

Figure  5-5  presents  the  wideband  spectrograms  for  four  different  abruptnesses  of 
glottal  closure.  In  LF  model,  different  values  of  ta  were  specified  since  ta  is  closely 
related  to  the  abruptness  of  glottal  closure  (Childers,  1995;  Childers  and  Ahn,  1994).  A 
variation  of  ta  also  causes  a variation  in  the  spectral  tilt  (Fant  et  al.,  1985;  Fant  and  Lin, 
1988).  As  ta  was  increased,  there  was  a reduction  of  the  intensities  of  high  frequencies  in 
the  spectrogram,  which  agrees  with  Childers  and  Ahn  (1994). 

In  the  second  part  of  this  experiment,  three  different  glottal  waveforms  were  used 
for  synthesis.  Figure  5-6  shows  the  differential  glottal  waveforms,  glottal  waveforms, 
glottal  source  power  spectra,  and  the  corresponding  LF  parameters.  Figure  5-7  presents 
the  wideband  spectrograms  of  the  synthetic  speech  signals  with  different  glottal  source 
waveforms.  The  speed  quotients,  defined  as  tp/(tc  - tp),  were  2.4  and  2.6  for  case  (a) 
and  case  (c),  respectively.  There  was  little  distinction  between  these  two  cases,  even 
though  the  open  quotients  (tc/T0)  were  0.58  and  0.90  and  the  pulse  widths  were  different. 
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Figure  5-5: 


Wideband  spectrograms  of  the  synthetic  speech  with  different 


source  spectral  tilt  specifications. 
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Figure  5-6:  The  differential  glottal,  glottal  waveforms,  and  power  spectral  density  (left-bright)  for  different  LF  parameters. 
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Figure  5-7 : The  wideband  spectrograms  of  different  glottal  waveforms. 
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The  synthetic  speech  for  case  (b)  did  sound  slightly  softer.  This  is  due  to  the  smaller 
speed  quotient  and  wider  pulse  width.  This  experiment  did  confirm  that  the  quality  of 
synthetic  speech  can  be  adjusted  by  controlling  the  LF  parameters. 

5.5  Experiment  F. 

This  experiment  was  conducted  to  study  the  effect  of  the  sinus  cavities  and 
velopharyngeal  opening  area  on  the  quality  of  nasalized  vowels.  An  eleven-section  nasal 
tract  cross-sectional  area  was  adopted  from  Maeda  (1982b).  The  maxillary  sinus,  which 
was  tuned  to  resonate  at  500  Hz,  was  located  at  4 cm  from  the  nostrils.  The  synthetic 
speech  tokens  were  vowels  /i/  and  /a/.  They  were  synthesized  with  and  without  the 
maxillary  sinus  and  for  five  different  values  of  velopharyngeal  opening  (0.1,  0.2,  0.3, 
0.45,  0.6  cm2).  The  vocal  tract  cross-sectional  areas  of  both  vowels  were  adopted  from 
the  optimized  results  of  12  American  vowels  in  Appendix  A.  A 60-section  vocal  tract  and 
a 60  kHz  sampling  frequency  were  used  in  this  experiment.  The  same  LF  parameters  as 
specified  in  Experiment  A were  used. 

First,  we  considered  no  sinus  coupling  case.  From  the  playback  of  the  synthetic 
speech  via  headphones,  there  is  no  obvious  nasality  for  five  different  velopharyngeal 
openings.  For  vowel  /i/,  only  the  voice  characteristics  changed  when  the  velopharyngeal 
opening  increased.  This  is  because  the  total  tract  length  is  increased  due  to  coupling  of 
the  pharyngeal  tract  and  the  nasal  tract.  On  the  other  hand,  the  vowel  /a/  has  a wide  oral 
tract  such  that  the  velopharyngeal  opening  does  not  affect  the  voice  characteristics,  except 
as  the  opening  becomes  large. 

Second,  we  considered  the  maxillary  sinus  coupling  case.  Nasality  is  apparent 
when  the  opening  of  the  velopharyngeal  port  is  more  than  0.2  cm2.  The  nasality  increases 
with  an  increase  in  the  opening  of  the  velopharyngeal  port.  Figure  5-8  and  Figure  5-9 
illustrate  the  waveforms  and  FFT  spectra  of  vowel  /i/  and  /a/,  respectively,  when  they 
were  synthesized  with  the  velopharyngeal  port  closed,  with  the  velopharyngeal  opening 
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(0.6  cm2)  but  no  sinus  coupling,  and  with  the  velopharyngeal  opening  (0.6  cm2)  and  with 
the  maxillary  sinus  coupling.  A paired  resonance-antiresonance  below  500  Hz  can  be 
observed  from  the  FFT  spectra  of  both  vowels,  Figure  5-8(c)  and  Figure  5-9(c),  when  the 
velopharyngeal  opening  is  0.6  cm2  and  the  maxillary  sinus  is  coupled.  We  feel  that  this 
low-frequency  pole-zero  pair  is  the  main  contribution  of  nasality.  Note  that  the 
antiresonances  due  to  the  coupling  of  the  nasal  tract  do  not  result  in  nasality.  Thus,  we 
concluded  that  the  maxillary  sinus  coupling  is  the  major  factor  of  nasality  and  the 
velopharyngeal  port  opening  controls  only  the  extent  of  nasality.  This  result  is  in 
agreement  with  the  simulation  studies  of  Maeda  (1982b),  Fant  (1985),  and  Lin  (1990). 

5.6  Experiment  F 

Fricative  and  plosive  consonants  have  traditionally  posed  a problem  area  in 
modeling  and  synthesis  because  of  the  complex  nature  of  their  production  mechanisms 
and  lack  of  sufficient  articulatory  and  aerodynamic  data  for  these  sounds.  The  dynamic 
and  time-varying  characteristics  and  the  complex  nature  of  their  production  have  made 
plosives  the  most  difficult  phonemes  to  model  and  synthesize.  Thus,  only  fricatives  are 
considered  in  this  experiment.  Fricatives  are  produced  by  the  formation  of  a narrow 
supraglottal  constriction  and  the  generation  of  turbulence  at  the  vicinity  of  this 
constriction.  Obtaining  articulatory  data  for  fricatives  via  X-ray  is  difficult  due  to  the 
narrow  constriction.  Estimating  such  data  from  speech  is  only  in  a preliminary  stage 
(Stevens,  1993a,  1993b;  Sorokin,  1994).  Thus,  a vocal  tract  cross-sectional  area  (Figure 
5-10),  estimated  and  refined  from  the  radiographic  measurements  of  a female  subject,  for 

the  fricative  ///  was  adopted  from  Badin  (1991)  for  this  experiment.  Modeling  the 
complex  turbulent  phenomena  is  another  challenging  problem.  In  this  experiment,  the 

fricative  ///  was  synthesized  with  the  turbulence  noise  source  located  at  1)  the  center  of, 
2)  immediately  downstream,  3)  upstream  from  the  constriction  region,  and  4)  spatially 
distributed  along  the  constriction  region.  The  model  presented  in  Figure  2-2 1(b)  was 
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Figure  5-10:  A female  vocal  tract  cross-sectional  area  of  fricative  /// 
(data  adopted  from  Badin  (1991)). 


Figure  5-1 1 : Power  spectral  density  of  synthetic  and  real  fricative  ///. 
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used  to  simulate  the  turbulence  noise  source.  The  turbulence  gain  and  critical  Reynolds 
number  were  specified  at  0.00000002  and  2700,  respectively.  The  glottal  volume  velocity 
was  assumed  to  be  a DC  source  and  set  at  1000  cm3/sec. 

Figure  5-11  compares  the  power  spectral  density  of  the  synthetic  and  real  speech 

for  the  fricative  ///  for  a female  speaker.  The  real  speech  was  from  another  female 
subject.  A 6-order  LPC  model  was  used  to  analyze  a segment  (300  samples)  of  the  speech 
signal.  It  can  be  seen  that  both  the  spectral  characteristics  of  the  noise  source  located  at 
the  center  of  and  upstream  from  the  constriction  region  are  similar  but  different  from  the 
other  two  locations.  The  resonant  peak  near  1800  Hz  is  due  to  the  resonance  of  the  large, 
long  back  cavity.  There  is  no  prominent  resonance  when  the  turbulence  noise  source  is 
distributed  along  the  constriction  region.  The  frequency  of  the  highest  resonant  peak  for 
the  downstream  case  (2800  Hz)  is  lower  than  that  for  real  speech  case  (3100  Hz).  We  feel 
that  this  difference  may  be  due  to  differences  from  the  vocal  tract  in  the  dimensions  and 
the  location  of  the  constriction  for  the  two  subjects.  The  synthetic  speech  of  downstream 
case  has  a second  resonance  near  1800  Hz.  This  means  that  the  back  cavity  has  an  effect 
on  the  synthetic  speech.  The  difference  in  the  high  frequency  area  between  these  two 
cases,  downstream  and  real,  may  be  attributed  to  the  source  characteristics  of  the 
turbulence  noise. 

Schroeter  and  Sondhi  (1994)  concluded  that  the  synthesis  of  fricatives  in  the 
articulatory  synthesizer  is  not  yet  satisfactory.  However,  from  our  experiment,  the 
turbulence  noise  source  location  was  found  to  have  important  acoustic  consequences.  The 
downstream  case  seems  to  be  able  to  generate  spectral  characteristics  close  to  the  real 
speech  case.  The  major  problem  for  synthesis  of  fricatives  lies  in  the  estimation  of  the 
relevant  parameters  from  the  acoustic  speech  signal,  such  as  inferring  articulatory  and 


source  information. 
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5.8  Summary 

Several  experiments  were  conducted  using  our  articulatory  synthesis  tool  in  this 
chapter.  The  study  of  the  effects  of  the  spatial-  and  time-domain  sampling  on  the 
synthesis  quality  has  shown  that  a 20-section  vocal  tract  and  a 20  kHz  sampling  frequency 
is  a minimal  requirement  to  synthesize  speech  with  a perceptually  insignificant  spectral 
distortion.  Four  different  interpolation  schemes  were  compared.  The  synthesis  quality  of 
the  interpolation  of  vocal  tract  cross-sectional  area  was  superior  to  the  interpolation  of 
articulatory  parameters,  because  interpolation  of  articulatory  parameters  produced  audible 
clicks  in  the  synthesized  speech.  In  Experiment  C,  we  confirmed  that  the  glottal 
impedance  and  the  subglottal  system  did  affect  the  synthesis  quality.  The  voice 
characteristics  can  be  adjusted  by  controlling  the  LF  model  parameters,  as  shown  in 
Experiment  D.  In  Experiment  E,  we  concluded  that  the  maxillary  sinus  coupling  is  the 
major  factor  contributing  to  nasality  and  the  velopharyngeal  port  opening  controls  only 
the  extent  of  nasality.  The  turbulence  noise  source  was  modeled  in  Experiment  F.  From 
this  experiment,  the  synthesis  of  fricatives  in  the  articulatory  synthesizer  was  shown  not 
yet  satisfactory,  while  the  location  of  turbulence  noise  source  was  found  to  have  important 
acoustic  consequences. 

A common  fault  was  noted  in  all  synthetic  speech  produced  by  the  time-domain 
articulatory  synthesis  procedure.  This  fault  was  that  an  echo  phenomenon  was  noted  in 
the  synthetic  speech.  We  feel  that  this  fault  was  caused  by  the  superposition  of  the  energy 
of  one  excitation  frame  with  the  residual  energy  of  a previous  excitation  frame.  This  is 
due  to  the  memory  of  the  synthesis  filter  (Childers,  1995).  This  problem  remains 
unsolved  in  digital  time-domain  articulatory  synthesis.  However,  overall,  the  quality  of 
the  synthetic  speech  produced  by  our  articulatory  synthesis  tool  is  good.  This  indicates 
that  the  articulatory  synthesis  tool  effectively  identifies  and  simulates  the  human  vocal 
system. 


CHAPTER  6 

CONCLUSIONS  AND  RESEARCH  EXTENSIONS 


6.1  Summary 

The  primary  goal  of  this  research  was  to  obtain  one  solution  to  the  speech  inverse 
filtering  problem  and  to  develop  a flexible  and  high  quality  articulatory  synthesis  tool. 
The  approach  adopted  is  based  on  a frequency-  and  time-domain  analysis  and  a 
time-domain  articulatory  synthesis  strategy.  A software  program  called  ARTM  was 
implemented  as  an  articulatory  synthesis  tool.  One  major  feature  of  this  research  tool  is 
the  simulated  annealing  optimization  procedure  that  is  used  to  optimize  the  vocal  tract 
parameters  to  match  a specified  set  of  formant  characteristics.  Another  feature  is  the  use 
of  a newly  derived  set  of  acoustic  equations  that  include  the  vocal  tract,  the  subglottal 
system,  the  glottal  impedance,  the  excitation  source,  the  turbulence  noise  source,  and  the 
nasal  tract  with  sinus  cavities  for  the  articulatory  synthesizer. 

6.1.1  Articulatory  Model  Implementation 

The  articulatory  model  was  implemented  to  transform  the  articulatory  parameters 
to  a vector  representation  of  the  vocal  tract  cross-sectional  area,  and  from  there,  to  the 
acoustic  characteristics  of  the  vocal  tract.  Our  articulatory  model  is  based  on  the 
Mermelstein  model  (1973)  with  several  modifications  for  the  lower  part  of  the  pharynx, 
the  hyoid  region,  and  the  tongue-tip-to-jaw  region.  The  vocal  tract  is  represented  by  as 
many  as  60  sections.  The  sagittal  grid  lines  are  oriented  according  to  the  position  of  the 
articulators  to  provide  more  reliable  estimates  of  the  vocal  tract  cross-sectional  areas.  By 
using  XView,  devguide,  and  C functions,  the  model  has  been  designed  with  special 
interfaces  that  provide  for  the  numerical  specification  of  parameters  as  well  as  sliding  bar 
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capabilities  that  allow  parameter  adjustments.  Such  a feature  has  made  the  setup  of  the 
initial  vocal  tract  configuration  for  the  optimization  scheme  easy,  fast,  and  flexible. 

6.1.2  Acoustic  Model  Realization 

A transmission-line  circuit  model  of  the  vocal  system,  which  includes  the  vocal 
tract,  the  nasal  tract  with  sinus  cavities,  the  glottal  impedance,  the  subglottal  tract,  the 
excitation  source,  and  the  turbulence  noise  source,  was  constructed.  The  acoustic  model 
of  each  subsystem  of  the  vocal  system  was  analyzed. 

The  vocal  tract  was  approximated  by  a non-uniform,  lossy,  soft  wall,  straight  tube 
with  60  concatenated  elemental  sections  (circular  or  elliptic).  The  transmission-line 
analogy  approach  was  used  to  model  the  vocal  tract  as  an  equivalent  circuit  network.  A 
series  resistor  represents  the  viscous  loss  and  a shunt  conductance  represents  the  thermal 
loss.  The  yielding  wall  vibration  loss  was  modeled  by  a shunt  impedance.  The  effect  of 
the  sinus  cavities  on  the  nasal  consonants  and  nasalized  vowels  was  discussed.  The  sinus 
cavity  was  regarded  as  a Helmholtz  resonator  and  was  modeled  as  a shunt  impedance. 
Radiation  models  were  discussed.  Flanagan’s  model  (1972)  was  considered  the  most 
appropriate  model  for  the  time-domain  articulatory  synthesis. 

For  the  non-interactive  excitation  source,  we  simplified  the  unified  glottal 
excitation  model  (Lalwani  and  Childers,  1991)  that  includes  the  jitter  model  and  shimmer 
model  into  the  LF  model.  For  the  interactive  excitation  source,  we  proposed  a new 
model,  which  consists  of  the  unified  glottal  excitation  model,  the  subglottal  model,  and 
the  glottal  area  model.  The  subglottal  system  was  modeled  by  three  cascaded  RLC  Foster 
circuits  (Ananthapadmanabha  and  Fant,  1982).  The  triangular,  sine,  and  raised-cosine 
functions  were  used  as  options  to  model  the  time-varying  glottal  area  function 
(Ananthapadmanabha  and  Fant,  1982). 

For  the  turbulence  noise  source  model,  the  distributed  and  series  pressure  noise 
source  model  (Flanagan  and  Cherry,  1968)  and  the  downstream  parallel  flow  source 
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model  (Sondhi  and  Schroeter,  1986,  1987)  were  discussed.  The  parallel  flow  source 
model  was  adopted  for  this  study.  The  turbulence  noise  source  can  be  located  1)  at  the 
center  of,  2)  immediately  downstream  from,  3)  upstream  from,  and  4)  spatially  distributed 
along  the  constriction  region. 

We  have  also  analyzed  the  effects  of  various  characteristics  of  the  vocal  system  on 
the  acoustic  transfer  function.  These  characteristics  include  the  frequency-dependent 
components  simulated  at  fixed  frequency,  the  number  of  vocal  tract  sections,  the  nasal 
tract  and  sinus  cavities,  the  glottal  impedance  and  subglottal  system,  and  the  excitation  in 
the  vocal  tract.  Such  an  analysis  provided  a basis  for  choosing  appropriate  parameters  for 
the  articulatory  synthesizer. 

6.1.3  Articulatory  Synthesizer  Implementation 

A practical  articulatory  synthesizer  was  proposed  that  included  the  vocal  tract,  the 
nasal  tract  with  sinus  cavities,  the  glottal  impedance,  the  subglottal  system,  the  excitation 
source,  and  the  turbulence  noise  source.  The  acoustic  equations  of  the  vocal  system  were 
derived  for  the  proposed  articulatory  synthesizer.  The  time-domain  approach  was  used  to 
simulate  the  dynamic  properties  of  the  vocal  system  as  well  as  to  improve  the  quality  of 
the  synthesized  speech.  The  vocal  tract  cross-sectional  area  or  the  articulatory  parameters 
were  interpolated  between  two  consecutive  target  frames  using  a linear  or  arctan  function. 

6.1.4  Speech  Inverse  Filtering 

A major  impediment  to  the  use  of  the  articulatory  synthesizer  has  been  the  lack  of 
a robust  algorithm  to  derive  articulatory  configurations  from  the  speech  signal  using 
speech  inverse  filtering.  Most  speech  inverse  filtering  procedures  using  conventional 
optimization  algorithms  have  a high  computational  burden,  are  easily  trapped  by  local 
minima,  and  are  unable  to  solve  the  non-uniqueness  problem.  The  simulated  annealing 
algorithm  was  introduced  to  solve  the  speech  inverse  filtering  problem.  Our  approach  is 
based  on  the  Corana  et  al.  (1987)  algorithm.  This  algorithm  is  sufficiently  constrained  to 
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avoid  the  non-uniqueness  problem  and  the  local  minima  problems.  The  constraints  in  the 
present  work  were  provided  by  the  articulatory-to-acoustic  transformation  function  and 
the  boundary  conditions  for  the  articulatory  parameters.  The  articulatory  vector  defines 
the  set  of  parameters  to  be  optimized.  The  cost  function  is  a percentage  of  the  weighted 
least-absolute-value  error  distance.  It  defines  a comparison  of  the  first  four  formant 
frequencies  between  the  model-generated  and  the  target-frame  (from  speech  analysis).  A 
1%  error  criterion  was  determined  to  be  sufficient  to  generate  natural  vocal  tract  shapes 
and  yet  be  practical  computationally.  Once  the  optimum  articulatory  vector  is  obtained, 
the  articulatory  model  determines  the  vocal  tract  cross-sectional  area  function,  which  in 
turn  is  used  by  the  articulatory  speech  synthesizer. 

Results  of  speech  inverse  filtering  for  twelve  typical  American  vowels  and  two 
speech  sentences  were  presented.  Default  annealing  parameters  that  control  the  simulated 
annealing  algorithm  were  given.  The  simulated  annealing  algorithm  has  proven  to  be 
efficient  and  flexible  in  dealing  with  the  problems  that  are  inherent  to  speech  inverse 
filtering.  However,  the  selection  of  parameters  for  the  annealing  schedule  is  an  obstacle 
for  the  simulated  annealing  algorithm,  since  we  know  little  about  the  relation  between  the 
argument  domain  (articulatory  vector)  and  the  technology  (the  algorithm).  The  guideline 
in  Appendix  D was  found  to  provide  an  effective  procedure  to  attack  the  speech  inverse 
filtering  problem  using  the  simulated  annealing  algorithm. 

6.1.5  Articulatory  Synthesis  Software  System 

A software  program  called  ARTM  was  implemented  with  devguide,  XView, 
and  C functions.  This  software  system  contains  several  phases.  The  analysis  phase 
extracts  the  formant  tracks  and  the  pitch  contour  from  the  speech  signal.  The  speech 
inverse  filtering  phase  first  selects  the  target  frames  from  the  formant  tracks,  constructs 
the  initial  configuration  of  the  vocal  tract  for  each  target  frame,  and  then  performs  the 
optimization  procedure  to  obtain  the  optimal  vocal  tract  cross-sectional  area.  The 
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excitation  phase  constructs  the  excitation  waveform  model  from  the  pitch  contour. 
Finally,  the  synthesis  phase  synthesizes  speech  using  the  vocal  tract  cross-sectional  area 
and  the  excitation  waveform  as  the  input. 

6.1.6  Experiments 

Several  experiments  were  conducted  by  changing  various  parameters.  Overall,  the 
quality  of  the  synthetic  speech  produced  by  our  articulatory  synthesis  tool  is  good.  This 
indicates  that  the  articulatory  synthesis  tool  effectively  identifies  and  simulates  the  human 
vocal  system.  However,  a common  fault  was  found  in  all  synthetic  speech  using  a 
time-domain  articulatory  synthesis  procedure.  This  fault  is  an  echo  in  the  synthesized 
speech  and  is  due  to  the  superposition  of  the  energy  of  a current  excitation  frame  with  the 
residual  energy  of  previous  excitation  frames. 

The  study  of  the  effects  of  the  spatial-  and  time-domain  sampling  on  the  synthesis 
quality  showed  that  a 20-section  vocal  tract  and  a 20  kHz  sampling  frequency  are  a 
minimal  requirement  to  synthesize  speech  with  a perceptually  insignificant  spectral 
distortion.  The  synthesis  quality  of  the  interpolation  of  vocal  tract  cross-sectional  area 
was  better  than  the  interpolation  of  articulatory  parameters.  The  glottal  impedance  and 
the  subglottal  system  were  found  to  affect  the  synthesis  quality.  We  showed  that  the  voice 
characteristics  can  be  adjusted  by  controlling  the  LF  parameters.  In  the  study  of  nasalized 
vowels,  we  concluded  that  the  maxillary  sinus  coupling  is  the  major  factor  for  nasality, 
while  the  velopharyngeal  port  opening  controls  only  the  extent  of  nasality.  As  to  the 
synthesis  of  fricatives  in  the  articulatory  synthesizer,  it  was  shown  not  yet  satisfactory. 
However,  the  turbulence  noise  source  location  was  found  to  have  important  acoustic 
consequences. 


6.2  Extended  Research 

Although  we  have  successfully  introduced  a new  solution  for  the  speech  inverse 
filtering  problem  and  developed  a flexible  and  high  quality  articulatory  synthesis  tool, 
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much  remains  to  be  further  investigated,  such  as:  1)  optimization  with  formant 
frequencies  and  bandwidths,  2)  speech  inverse  filtering  for  nasals,  fricatives,  and  plosives, 
3)  improvement  of  the  turbulence  noise  source  model,  4)  glottal  inverse  filtering  and  LF 
model  parameters  extraction,  5)  interpolation  of  the  articulatory  trajectories,  and,  6)  a 
source  model  for  excitation  relocation. 

6.2.1  Optimization  with  Formant  Frequencies  and  Bandwidths 

The  formant  frequencies  of  natural  speech  are  often  closely  reproduced  in  the 
synthesis  but,  as  shown  by  wideband  spectrograms  in  Chapter  5,  the  bandwidths  and 
intensities  of  the  formants  are  often  greatly  mismatched.  Parthasarathy  and  Coker  (1992) 
have  shown  that  bandwidth  optimization  has  improved  spectral  matches  and  made  the 
synthesis  quality  much  closer  to  the  natural  speech.  Thus,  adding  the  bandwidths  of  the 
formants,  or  even  the  amplitudes  of  the  formants,  as  additional  acoustic  features  seems  to 
be  necessary. 

6.2.2  Speech  Inverse  Filtering  for  Nasals.  Fricatives,  and  Plosives 

Successful  solution  of  the  speech  inverse  filtering  problem  provides  the  potential 
development  for  articulatory-based  speech  synthesis,  recognition,  and  coding.  Our  speech 
inverse  filtering  procedure  was  devoted  to  vowels,  semivowels,  and  diphthongs.  Little 
has  been  done  for  nasals,  fricatives,  and  plosives  in  the  literature.  To  determine  the  vocal 
tract  shape  for  nasals,  an  ARMA  model  may  be  needed  to  analyze  the  complex  pole-zero 
patterns  of  natural  speech.  Additional  acoustic  features,  for  example,  the  antiresonant 
frequencies  and  bandwidths,  may  need  to  be  considered.  For  fricatives,  the  frequencies 
and  bandwidths  of  one  to  three  peaks  in  the  spectra  may  be  used  as  acoustic  features,  since 
fricatives  have  a segment  with  a quasi-stationary  spectrum  (Sorokin,  1994).  Positive 
results  have  been  obtained  by  Sorokin  (1994)  with  an  analysis-by-synthesis  approach 
based  on  minimal  muscle  work  criterion.  For  plosives,  the  characteristics  of  the  short 
segments  of  speech  for  these  phonemes  have  made  the  speech  inverse  filtering  very 
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difficult,  if  not  impossible.  The  detailed  measurements  of  the  changing  spectra  of 
frication  and  aspiration  noise  at  the  release  of  a stop  consonant  may  make  it  possible  to 
infer  articulatory  and  laryngeal  configurations  and  movements  (Stevens,  1993a,  1993b). 

6.2.3  Improvement  of  the  Turbulence  Noise  Source  Model 

The  most  difficult  sounds  to  model  and  synthesize  are  fricatives  and  plosives;  this 
is  mainly  due  to  the  lack  of  sufficient  articulatory  and  aerodynamic  data  for  these  sounds. 
Due  to  the  inadequate  source  and  filter  models,  the  synthesis  of  fricatives  and  plosives  in 
the  articulatory  synthesizer  has  not  been  satisfactory  (Schroeter  and  Sondhi,  1994). 
Aerodynamic  results  garnered  from  the  theory  and  experiments  on  spoiler  generated  noise 
in  ducts  may  suggest  a way  for  the  turbulence  noise  source  model  (Stevens,  1971).  Work 
from  Shadle  (1991),  the  acoustics  of  fricative  consonants  based  on  mechanical  models  of 
the  vocal  tract,  may  offer  another  way  to  model  the  turbulence  noise  source  spectra. 

6.2.4  Glottal  Inverse  Filtering  and  LF  Model  Parameters  Extraction 

The  two-pass  method  (Childers  and  Lee,  1991)  was  used  in  this  study  for  glottal 
inverse  filtering.  Positive  results  have  been  obtained  for  sustained  vowels,  but  further 
development  is  needed  for  connected  speech  (Childers  and  Lee,  1991).  The  extraction  of 
the  LF  model  parameters  to  a large  extent  depends  on  the  accuracy  of  the  glottal  inverse 
filtering.  Most  current  available  methods  for  estimating  glottal  waveforms  are  restrictive 
in  one  way  or  another.  The  reason  is  due  to  the  errors  in  the  formant  locations  and 
bandwidths,  and,  especially,  the  spurious  poles  generated  by  linear  prediction  techniques. 
An  interactive  computer  program  that  permits  manipulation  of  frequencies  and 
bandwidths,  i.e.,  a time-varying  filter  that  is  manually  adapted  to  the  speech  waveform, 
has  been  shown  satisfactory  for  glottal  inverse  filtering  of  connected  speech  (Gobi,  1988). 
Another  method,  called  Iterative  Adaptive  Inverse  Filtering  (IAIF),  has  also  shown  that 
quite  reliable  results  can  be  obtained  for  most  of  the  speech  (Alku,  1992).  Both  methods 
may  contribute  a solution  for  accurate  extraction  of  LF  model  parameters. 
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6.2.5  Interpolation  of  the  Articulatory  Trajectories 

It  is  well  known  that,  in  certain  transitions,  some  articulators  are  moved  more 
rapidly,  for  example,  the  tongue  tip  position  has  to  be  changed  suddenly  during  the 
transition  from  phoneme  A/  to  phoneme  /r/.  As  in  the  Heike  (1979)  preliminary  study, 
there  is  a constant  relationship  between  articulatory  variables  and  formants.  He  indicated 
that,  in  an  articulatory  model  based  synthesis,  the  non-linear  articulator  movements 
should  be  used  to  produce  the  non-linear  formant  movements  of  closure  gestures.  Thus,  a 
non-linear  interpolation  scheme  for  the  articulatory  parameters  may  result  in  an 
improvement  in  the  quality  and  naturalness  of  synthetic  speech.  Though  we  provided  two 
interpolation  functions,  linear  and  arctan,  to  interpolate  the  vocal  tract  cross-sectional  area 
and  the  articulatory  parameters,  more  sophisticated  methods  are  required.  Here,  a couple 
of  approaches  are  recommended.  First,  two  controlling  parameters  of  the  arctan  function, 
the  transition  rate  and  the  point  of  transition,  might  need  to  be  specified  or  optimized  for 
each  articulatory  parameter  to  better  describe  the  movement  of  each  articulator.  Gupta 
and  Schroeter  (1993)  have  shown  that  optimized  parameters  of  the  arctan  function  for 
each  articulator  provides  a possibility  to  match  the  dynamic  behavior  of  different 
articulators.  Based  on  the  assumption  that  the  articulator  motion  is  generated  by  a 
neuro-muscular  position  servo,  Parthasarathy  and  Coker  (1992)  proposed  a new 
interpolation  function.  The  function  has  three  active  regions:  a parabolic  acceleration,  a 
constant  velocity  region,  and  a parabolic  deceleration.  A good  match  of  the  formant 
tracks  between  natural  and  synthetic  speech  was  obtained  with  optimized  interpolation 
parameters  for  each  articulator.  Finally,  the  task-dynamic  model  (Saltzman  and  Kelso, 
1987)  may  be  used  to  produce  articulatory  trajectories.  The  task-dynamic  model  uses  a 
set  of  dynamic  parameters  and  geometric  transformations  to  simulate  the  movements  of 
the  articulators  and  to  match  the  formant  trajectories.  Browman  and  Goldstein  (1989, 
1990)  have  applied  this  model  to  analyze  the  phonological  structure,  even  the  issue  of  the 
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adequacy  of  task-dynamic  model  to  describe  the  movement  of  articulators  remains  a 
research  question  (McGowan,  1994). 

6.2.6  Source  Model  for  Excitation  Relocation 

The  calculation  of  the  acoustic  transfer  function  of  excitation  relocation  may  open 
a new  field  of  speech  prothesis  for  vocally  handicapped  people.  As  we  discussed  in 
section  2.4.5,  Chapter  2,  different  phonemes  have  different  acoustic  transfer  functions, 
due  to  the  different  vocal  tract  shapes.  This  makes  the  modeling  of  the  excitation 
waveform  inside  the  vocal  tract  difficult,  if  not  impossible.  One  possible  way  to  generate 
the  excitation  waveform  inside  the  vocal  tract  is  to  prefilter  the  glottal  pulse  with  the 
inverse  filter  of  the  acoustic  transfer  function  of  excitation  relocation. 

Figure  6-1  shows  a possible  procedure  to  solve  the  modeling  of  the  excitation 
waveform  inside  the  vocal  tract.  For  a specific  phoneme,  both  the  acoustic  transfer 
function  of  normal  excitation,  denoted  as  HnoiTnal  in  Figure  6-1  (a),  and  the  excitation 
relocation,  denoted  as  Hexreioc  in  Figure  6-1  (b),  can  be  easily  calculated.  If  it  can  be 
shown  that  HeXreioc  is  invertible,  i.e.,  minimum  phase,  a modified  acoustic  transfer 
function,  denoted  as  Hmodified  in  Figure  6-1  (c),  can  be  computed.  Then  the  excitation 
waveform  inside  the  vocal  tract  would  be  the  glottal  waveform  prefiltered  by  the  modified 


acoustic  transfer  function. 
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Figure  6-1 : The  procedure  to  model  the  excitation  inside  the  tract  excitation. 


APPENDIX  A 

A COLLECTION  OF  FEATURES  FOR  TYPICAL  AMERICAN  VOWELS 


There  are  twelve  principal  vowels  in  American  English,  each  with  a different  set 
of  acoustic  characteristics  that  depend  on  the  positions  of  the  jaw,  the  tongue,  and  the  lips. 
Collected  in  this  appendix.  Figure  A-l  to  Figure  A-12,  are  the  approximated  articulatory 
configurations  (midsagittal  vocal  tract  outlines),  typical  acoustic  waveforms,  vocal  tract 
frequency  responses,  and  the  corresponding  vocal  tract  cross-sectional  area  functions  for 
each  vowel.  The  diagram  descriptions  for  each  figure  are  illustrated  as  follows: 


Midsagittal  vocal  tract  outline.  Note 
that  the  approximate  articulatory 
configuration  is  obtained  from  the 
first  four  formants  of  the  speech 
signal  by  using  a simulated 
annealing  optimization  algorithm. 
The  floating  point  number  indicates 
the  error  distance.  See  Chapter  3. 

Typical  speech  waveform 
containing  three  periods 
of  the  vowel. 

Vocal  tract  cross-sectional  area. 
Note  that  the  x-axis  represents  the 
vocal  tract  length  (in  cm)  from  the 
glottis  to  the  lips,  while  the  y-axis 
represents  the  cross-sectional  area 
in  cm2.  This  area  function  is 
obtained  from  the  optimized 
midsagittal  vocal  tract  outline. 

Vocal  tract  frequency  response. 
The  x-axis  represents  frequency 
from  0 to  5000  Hz  and  the  y-axis 
is  the  magnitude  in  dB. 
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Figure  A-2:  Vowel  /i/  as  in  beet. 
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Figure  A-3:  Vowel  /ae/  as  in  bat. 


Figure  A-A:  Vowel  /£/  as  in  bet. 
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Figure  A-6:  Vowel  A/  as  in  but. 
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Figure  A-8:  Vowel  /a/  as  in  father. 
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Figure  A-9:  Vowel  Ar/  as  in  book. 


Figure  A-10:  Vowel  /u/  as  in  boot. 
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Figure  A-12:  Vowel  /ou/  as  in  boat. 


APPENDIX  B 

ACOUSTIC  TRANSFER  FUNCTION  CALCULATION 


The  formant  frequencies  are  required  by  the  simulated  annealing  algorithm  to 
calculate  the  articulatory-to-acoustic  inverse  transform.  We  need  to  decompose  the 
formant  frequencies  from  the  acoustic  transfer  function  or  from  the  input  impedance,  both 
of  which  are  calculated  from  the  vocal  tract  cross-sectional  area  function.  In  this 
appendix,  we  derive  the  acoustic  transfer  function  for  different  structures  using  a circuit 
network  representation  and  transmission  matrix  theory. 


In  Chapter  2,  section  2.3. 1.3,  we  mentioned  that  an  elemental  tube  with  uniform 
area  A and  length  / can  be  analogous  to  the  transmission  line  circuit  model.  Refer  to  the 
circuit  of  Figure  2-10  and  the  corresponding  component  definitions  in  Table  2-2.  The 
circuit  in  Figure  2-10  can  be  rearranged  as  a four-terminal  T-network,  as  shown  in  Figure 
B-l.  This  network  representation  greatly  facilitates  the  acoustic  transfer  function 
calculation,  which  is  defined  as  the  ratio  of  the  total  output  volume  velocity  from  the 
vocal  tract  and  the  nasal  tract  to  the  excitation  volume  velocity  source  input.  Define  the 

propagation  constant  y as  y = Jz  y,  where  z = R + jtoL,  y = G + jcoC  + =— , and 

Z.w 


Zw  = Rw  + jcoLw  + . * . Let  the  characteristic  impedance  be  Z0  = J z/y.  Then  we 

JCDL-W 


/ y A Zq 

have  the  hyperbolic  elements  Za  = Z0tanh^yj  and  Zb  = . 


See  Fant  (1960)  and 


Flanagan  (1972)  for  the  details. 

It  is  well  known  that  input-output  characteristics  of  a four-terminal  network  are 
described  by  a matrix  equation  of  the  form 
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-Pi(co)- 

‘Po(co)‘ 

= [T(U))J 

Ui(co) 

U0(W) 

(B.l) 


where  P;(a)),  Uj(aj)  are  the  sound  pressure  and  volume  velocity,  respectively,  at  the  input 
of  the  network;  P0(oo),  U0(co)  are  the  corresponding  outputs  of  the  network;  and  T(co)  is 
the  transmission  matrix  (also  called  ABCD  matrix  or  chain  matrix)  of  the  four-terminal 
network.  From  Ohm’s  law  and  the  current  loop  law,  the  transmission  matrix  of  Figure 
B-l  is 
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Now,  consider  a two  elemental  sections,  with  a shunt  impedance  Zs  in  between 
and  with  radiation  impedance  Zr  as  load.  We  represent  the  radiation  impedance  as 
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Zr  = Rr  + jU)Lr,  where  Rr  = A — , Lr  = 


=,  and  Am  is  the  lip  or  nostril 


km 


9jl2Am’  1 3tt  Mr 
opening  area.  The  network  representation  is  constructed  in  Figure  B-2.  Note  that  the 
shunt  impedance  Zs  can  be  considered  as  the  modelling  circuit  for  the  nasal  sinus  (see 
section  2.3.2).  Assume  that  the  transmission  matrices  of  section  i-1  and  section  i are 
T^^oa)  and  Tj(o))  respectively.  Forming  a dummy  four-terminal  T-network  as  in  Figure 
B-l  with  Za  = 0 and  Zb  = Zs  for  the  shunt  element  Zs  and  applying  equation  (B.2),  the 
transmission  matrix  is 

‘As  Bs" 

Cs  Ds 


Ts(co)  = 


0 


Zs 


(B.3) 


From  the  theory  of  cascade  circuit  networks,  the  overall  transmission  matrix  for  Figure 
B-2  is  the  product  of  the  individual  transmission  matrices  for  the  sections.  Thus  the 
relation  between  pressure  and  volume  velocity  at  the  input  P;_  j,  j and  at  the  radiation 
Pr,  Ur  can  be  written  as 
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Figure  B-l : Four-terminal  T-network  for  a uniform  elemental  tube. 


Figure  B-2:  Network  representation  of  a two-section  tube  with  a 

shunt  element  in  between  and  a radiation  load. 
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From  this  matrix  equation  and  Ohm’s  law,  Pr  = ZrUr,  the  transfer  function  H(co)  from 
Uj.j  to  Uris 


H(co)  = 


Ur(u)) 


1 


Uj_j((u)  CfZr  + Df 


(B.5) 


and  the  input  impedance  Z-m  is 


Zin(CO)  = 


Pj-i(to)  _ AfZr  + Bf 
Uj_j(to)  CfZr  + Dj- 


(B.6) 


By  use  of  the  four-terminal  network  representation  and  the  transmission  matrix, 
we  can  calculate  the  acoustic  transfer  function  of  the  vocal  tract  system  for  human  speech 
production.  Since  the  excitation  source  can  be  at  the  glottis  or  inside  the  vocal  tract,  there 
are  four  cases. 


Case  A:  Excitation  source  located  at  the  glottis.  Let  Zor  and  Znr  be  the  radiation 
impedances  of  the  vocal  tract  and  the  nasal  tract,  respectively.  Let  Zsub  as  the  impedance 
of  the  glottis  and  the  subglottal  system  looking  backward  from  the  glottis.  Use 
ApBpCpDp,  A0B0C0Do,  and  AnBnCnDn  to  represent  the  transmission  matrices  of  the 
pharyngeal  tube,  oral  tract,  and  nasal  tract,  respectively.  Figure  B-3(a)  shows  the 
equivalent  network  representation  of  the  vocal  system. 


To  calculate  the  acoustic  transfer  function  H(u>)  from  Ug  to  Ur  = Uor  + Unr,  we 
first  compute  the  input  impedance  of  the  oral  tract  and  the  input  impedance  of  the  nasal 
tract,  both  of  which  are  seen  downstream  from  the  bifurcation  point.  From  equation 
(B.6),  we  have  the  input  impedance  of  oral  tract  as 


Z0 


A0Z0r 

CpZor 


+ B0 
+ D0 


(B.7) 


and  the  nasal  tract  input  impedance  as 
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Figure  B-3:  Case  A with  the  excitation  source  located  at  the  glottis. 

(a)  network  representation  of  vocal  system; 

(b)  network  representation  for  calculating  Hv(to); 

(c)  network  representation  for  calculating  Hn(io). 
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Zn  = 


AnZnr  ~h  Bn 
CnZnj  + Dn 


(B.8) 


The  second  step  is  to  calculate  the  acoustic  transfer  function  Hv(to)  of  the  vocal 
tract  from  Ug  to  Uor-  Forming  an  equivalent  network  for  Figure  B-3(a)  (see  Figure 
B-3(b))  and  applying  the  analysis  results  of  Figure  B-2  (refer  to  equation  (B.5)),  we  have 
Hv(to)  as 


Hv  = 


^sub 


1 


Zp  + ^sub 


CvvZor  T Dvv 
and  the  corresponding  transmission  matrix  as 
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Similarly,  from  equation  (B.7),  the  input  impedance  of  the  vocal  tract  seen  downstream 
from  the  glottis  is 
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Applying  the  same  procedure  to  the  Figure  B-3(c),  the  acoustic  transfer  function  Hn(oo) 
from  Ug  to  the  output  of  the  nasal  tract  Unr  is 


Hn  = 


Jsub 


P ' ^sub  CnnZnr  + Dr 


zp  + z 

and  the  corresponding  transmission  matrix  is 


(B.12) 


Ann 

Bnn 

Ap 

Bp' 

’1 

O' 

An  Bn 

1 

1 

Cnn 

Dnn 

CP 

Dp 

Z0 

Cn  Dn 

(B.13) 

The  vocal  system  acoustic  transfer  function  is  H(co)  = Hv(w)  + Hn(co). 


Case  B:  Excitation  source  located  in  the  pharvngeal  tube.  Let  ApiBpiCpjDpi 
represent  the  transmission  matrix  of  the  lower  part  of  the  pharynx  with  the  input  at  the 
excitation  location  and  output  is  at  the  glottis,  as  shown  in  Figure  B-4.  The  results 
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^or 


Znr 


Zn 


Figure  B-4:  Case  B with  the  excitation  source  located  in  the  pharyngeal  tube. 


Znr 


Zn 


Figure  B-5:  Case  C with  the  excitation  source  located  at  the  bifurcation  point. 
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obtained  in  case  A can  be  applied  to  this  case.  Replace  the  Zsub  in  equations  (B.9)  and 
(B.12)  with  Zsub,  where  the  Zsub  is  calculated  as  follows: 


^plZsub  ®pl 
CplZsub  + Dpl 


(B.14) 


Then,  the  sum  of  equations  (B.9)  and  (B.12)  yields  the  final  acoustic  transfer  function. 


Case  C:  Excitation  source  located  at  the  bifurcation  of  the  vocal  tract  and  the 
nasal  tract.  As  shown  in  Figure  B-5,  the  network  representation  is  the  same  as  Figure 
B-3  except  the  excitation  source  is  located  at  the  bifurcation  of  the  vocal  tract  and  the 
nasal  tract.  First,  the  input  impedances  ZQ  and  Zn  are  calculated  as  in  equations  (B.7)  and 
(B.8).  Then  Hv(to)  and  Hn(a»)  are  obtained  as  follows: 


Hv  = -t 


7'  7 

^sub^n 


1 


Hn  = - 


^sub^o  + Z0zn  + ZnZsub  C0z0r  + D0 

ZsubZ°  1 


ZsubZo  T Z0Zn  + ZnZsub  CnZm-  + Dn 


(B.15) 


(B.16) 


where  the  impedance  ZgUb  is 


7'  — ^P^sub  Bp 

sub  CpZsub  + Dp 


(B.17) 


The  sum  of  Hv(co)  and  Hn(o))  yields  the  acoustic  transfer  function  for  the  vocal  system. 


Case  D:  Excitation  source  located  in  the  oral  tract.  As  shown  in  Figure  B-6(a), 
we  use  AobBobCobDob  and  AofBofCofDof  to  represent  the  transmission  matrix  of  the  back 
and  front  parts  of  oral  tract,  respectively.  From  equations  (B.7),  (B.8),  and  (B.17),  we 
calculate  the  impedances  Z0,  Zn,  and  Z’ub  as  follows: 
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(a) 


Z0 


(b) 


Zns 


Zp  Z0 


Figure  B-6:  Case  D with  the  excitation  source  located  in  the  oral  tract. 

(a)  network  representation  of  vocal  system; 

(b)  network  representation  for  calculating  Hv(o)); 

(c)  network  representation  for  calculating  Hn(co). 
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Z0  = 


AofZor  T B 


of 


C0fZor  + Dof 


Zn  = 


AnZu  + Bn 
CnZnj-  + Dn 


(B.18) 

(B.19) 


ApZsub  + Bp 
CpZsub  + Dp 


If  we  define  Zm  (see  Figure  B-6(b))  as 


Z 


ns 


^nZsub 
Zn  + Zsub 


then  the  impedance  Zp  can  be  calculated  as 

y _ AobZns  + Bob 
P CobZns  + Dob 


(B.20) 


(B.21) 


(B.22) 


Now,  apply  equation  (B.9)  to  obtain  Hv(oo)  as  follows: 


Hv 


^p 

Zp  + Z0 


1 

CofZor  + Dof 


(B.23) 


For  Hn(u)),  we  form  the  network  as  shown  in  Figure  B-6(c)  and  apply  equation 
(B.12)  to  yield 


Hn  = 


Zo 


1 


Zp  + Zo  C0nZnr  "F  Don 
where  the  corresponding  transmission  matrix  is 


(B.24) 


Aon 

Bon 

1 

> 

o 

cr 

Bob' 

1 

1 

O' 

1 

'An 

Bn' 

Con 

Don 

o 

u 

Dob 

ZSub 

Cn 

Dn 

(B.25) 


Then,  the  final  acoustic  transfer  function  is  Hv(o))  + Hn(to). 


We  have  considered  the  acoustic  transfer  function  of  the  soft-wall,  lossy  vocal 
system.  For  some  cases,  calculation  of  the  acoustic  transfer  function  of  a lossless  or  lossy, 
hard-wall  vocal  system  is  adequate  and  also  easier.  For  a rigid  and  lossy  vocal  system,  the 
wall  impedance  Zw  of  each  section  (see  Figure  2-10)  is  removed,  causing  the  impedance 
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Zw  to  be  infinity.  In  this  case,  the  admittance  y becomes  y = G + jcoC.  For  a rigid  and 
lossless  vocal  system,  this  means  that  no  dissipation  elements  and  no  wall  impedance  are 
present  for  each  section.  Then  the  propagation  constant  y becomes  an  imaginary  quantity. 

In  the  above  derivations,  we  assume  that  impedance  Zsub  is  composed  of  the 
Foster-chain  circuit  model  for  the  subglottal  system  with  glottal  impedance 
zg  = Rg  + jwLg’  where  Rg  and  Lg  are  defined  in  section  2.3.5.  Since  the  glottal  area  is 
a time-varying  function,  we  assume  that  the  glottal  area  and  glottal  volume  velocity  are 
constants  when  we  evaluate  the  acoustic  transfer  function.  In  case  A,  if  the  subglottal 
system  and  the  glottal  impedance  are  not  included,  then  we  let  the  impedance  Zsub  -*>  oo 
which  corresponds  to  an  open  circuit,  i.e.,  the  block  representing  the  subglottal  system 
and  the  glottal  impedance  in  Figure  B-3  are  disconnected.  For  other  cases,  two  glottal 
conditions  are  considered.  If  the  glottis  is  open,  the  effects  of  the  subglottal  system  and 
the  glottal  impedance  are  evaluated.  Otherwise,  if  the  glottis  is  closed,  we  let  Zsub  = 0, 
i-e->  zsub is  a short  circuit.  The  velopharyngeal  port  opening  area,  which  couples  the  nasal 
tract  to  the  vocal  system,  also  affects  the  acoustic  transfer  function.  Our  software  system 
allows  the  user  to  change  the  velopharyngeal  port  opening  area,  the  number  of  coupled 
sinus  cavities,  and  the  location  of  the  excitation  source  when  evaluating  the  acoustic 


transfer  function. 


APPENDIX  C 

DERIVATION  OF  DISCRETE-TIME  ACOUSTIC  EQUATIONS 


As  covered  in  Chapter  2,  section  2.3.1,  the  vocal  tract  tube  can  be  described  by 
two  coupled  partial  differential  acoustic  equations.  These  two  acoustic  equations  are 
functions  of  both  time  and  space.  Approximating  the  vocal  tract  as  a sequence  of 
elemental  sections  corresponds  to  digitizing  the  vocal  tract  in  space,  i.e.,  spatial  sampling. 
For  each  elemental  section,  the  transmission-line  analog  approach  is  applied  to  form  the 
equivalent  circuit  model,  as  seen  in  Figure  2-10.  Connecting  the  equivalent  circuit  of 
each  section  together  in  combination  with  the  equivalent  circuit  models  of  the  other  parts 
of  the  vocal  system  (subglottal  system,  glottis,  and  nasal  sinus  cavities),  a lumped  circuit 
network  representation  of  the  vocal  system  can  be  formed,  as  shown  in  Figure  2-20.  For 
the  time-domain  approach,  the  Kirchoff’s  and  Ohm’s  laws  are  applied  to  the  circuit 
network  to  obtain  sets  of  differential  equations.  These  differential  equations,  which 
correspond  to  the  equivalent  acoustic  equations  that  govern  the  generation  and  the 
propagation  of  acoustic  waves  inside  the  vocal  system,  are  transformed  into  discrete-time 
representations.  This  appendix  provides  a detailed  derivation  of  the  discrete-time  acoustic 
equations,  i.e.,  the  difference  matrix  equations.  The  discretization  scheme  is  similar  to  the 
work  of  Maeda  (1982a).  Our  model,  however,  provides  more  features,  such  as  the 
subglottal  system,  nasal  sinus  cavities,  and  turbulence  noise  source. 

Consider  the  transmission-line  circuit  model,  shown  in  Figure  C-l,  of  the  i* 

section  of  the  vocal  tract.  Define  the  volume  velocity  (current)  at  the  input  of  the  i* 

i—  1 

section,  i.e.,  at  \i_1  in  space,  as  U;(t)  = u^Xj.j,  t),  where  Xj_j  = ^ /k  is  the  vocal 

k = 1 

tract  length  from  the  glottis  to  the  i*  section.  Similarly,  define  the  central  pressure 
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162 


Figure  C— 1 : A lumped  transmission-line  circuit  model  of  the  ilh  section. 


Figure  C-2:  A shunt  element  of  the  nasal  sinus  inserted  between  two  sections. 
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(voltage)  of  the  i*  section  as  P;(t)  = pj(Xi  - t),  where  is  the  section  length  of 

section  i.  From  Kirchoff ’s  and  Ohm’s  laws,  we  have  the  following  differential  equations: 

pi-i(t)  - pi(0  = ^([Li-i(t)  + LjCOjUiCt)}  + [Rj.jCO  + Ri(t)]Ui(t) 

Ui(t)  ~ Ui  + 1(t)  = un(t)  + Ui2(t)  + ui3(t) 

= GKOPjft)  + AlCiCOPift))  + u3(t) 


pi(t)  = Rw>i(t)Ui3(t)  + ^ 


i 

(Lw,i(t)uB(t)J  + f 


Ui3W 


Cw,i(t) 


dx 


(C.1) 


There  are  three  terms  in  equation  (C.l);  simple,  differential,  integral  terms.  Define  the 
general  forms  of  these  three  terms  as  follows: 

y i (t)  = c ! (t)x(t) 

y2(t)  = ^jc2(t)x(t)) 

t 


y3(0  = 


j {c3(x)x(x)}dx 


(C.2) 


where  Cj(t)  (i=l,  2,  or  3)  is  a coefficient  that  represents  the  time-varying  circuit 
component,  and  x(t)  can  be  Pj(t),  U^t),  or  Ui3(t).  Let  yi(n)  = yi(t  = nT), 

Ci(n)  = Cj(t  = nT),  and  x(n)  = x(t  = nT)  represent  the  sampled  values  of  yi(t),  C;(t), 
and  x(t),  respectively,  at  t = nT,  where  n = 0,  1,2,  ...,  and  T denotes  the  sampling  time 
interval.  For  the  first  term,  it  is  obvious  that 

yi(n)  = Cj(n)x(n)  (C.3) 

For  the  differential  term,  we  first  integrate  both  sides 

nT  nT  nT 

| y2(t)dt  = | ^(c2(t)x(t)jdt  = | d{c2(t)x(t)J  (C.4) 

(n-l)T  (n-l)T  (n-l)T 

The  trapezoidal  rule  is  applied  to  the  left  hand  side  of  equation  (C.4)  to  yield 
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nT 

I 

(n-l)T 


y2(t)dt  = j[y2(n)  + y2(n-l)] 


Then  the  discrete-time  form  of  equation  (C.4)  is 

5 foC®)  + y2(n_1)]  = c2(n)x(n)  “ c2(n-l)x(n-l) 
Equation  (C.6)  can  be  rewritten  in  recursive  form  as 
y2(n)  = ^c2(n)x(n)  - Q(n-l) 
where  Q(n-l)  = ^rc2(n-l)x(n-l)  - Q(n-2) 

Similarly,  the  integral  term  is  approximated  by 

y3(n)  = yc3(n)x(n)  + V(n-l) 
where  V(n-l)  = Tc3(n-l)x(n-l)  + V(n-2) 


(C.5) 


(C.6) 


(C.7) 


(C.8) 


Now  applying  the  rules  of  equations  (C.3),  (C.7),  and  (C.8)  to  the  set  of 
differential  equations  (C.  1),  the  transformed  equations  are 

Pi-t(n)  - Pi(n)  = {Ri-!(n)  4-  Rj(n)  4-  |[Li_1(n)  4-  Lj(n)]  Uj(n)  - QL  iL(n-l) 

Uj(n)  - Ui+1(n)  = Gj(n)  4-  ^C^nljp^n)  - Qc(n-1)  4-  uB(n) 

Pi(n)  = Rw  i(n)  4-  ^Lwi(n)  + ^(n)  + Vc  (n-1)  - QL  (n-1) 


,i(n) 


(C.9) 


QL,_lLi(n-l)  = f[Li-i(n-D  + Li(n-l)]Ui(n-l)  - QL.  lL.(n-2) 
QCl(n-l)  = ^Ci(n-l)Pi(n-l)  - Qc_(n-2) 

Vc./n-1)  = + Vc./"-2) 

QLwi(n-l)  = |Lw-(n-l)ui3(n-l)  - QLJn-2) 


where 
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Define 


a;(n)  = ^Li(n)  + R,(n) 


YWji(n)  = 


1 


T^w,i(n)  + ^w,i(n)  + 2Cwi(n) 

1 


bj(n)  = ^ 

T Ci(n)  + Gj(n)  + Ywi(n) 

Vi(n-l)  = Qc(n-1)  - Ywi(n)[QL  (n-1)  - Vc  (n-1) 

1 ’ W'1  w,i 

ui3(n)  = Ywi(n)[p;(n)  + QL  (n-1)  - Vc  (n-l)l 

Hj(n)  = a^jCn)  + a^n)  + bj.^n)  + b;(n) 

Fi(n)  = bi_1(n)Vi_1(n-l)  + QL(  iLXn-l)  - bi(n)Vi(n-l) 


(C.10) 


After  some  manipulations,  we  obtain  the  difference  equations  for  the  i*  section,  as 
follows: 


Fi(n)  = - b^jO^Uj^n)  + H^U^n)  - b^Uj  + ^n) 

P;(n)  = b^U^n)  - Ui  + 1(n)  + V^n-l)]  (C.ll) 


Next,  consider  a shunt  element,  which  models  the  nasal  sinus,  inserted  in  between 
two  sections,  i and  i+1.  Figure  C-2  is  the  corresponding  circuit  model.  The  set  of 
differential  equations  is  given  by 

pi«  - Psin(!)  = gf[l*i(t)U|(t)]  + R,(t)ui(t) 

- Pi+l(D  = ^[Li  + 1(t)Ui  + 1]  + Rj  + 1(t)Ui  + 1(t) 

t 

^sinW  = ^sinUsin^)  ^sin  ^ (J-  j Usin(T)dT  (C.  12) 

0 

Applying  the  rules  of  equations  (C.3),  (C.7),  and  (C.8)  to  equation  set  (C.12),  the 
corresponding  difference  equations  are 


166 


where 


pi(n)  “ psin(n)  = ai(n)Ui(n)  - QLi(n-l) 
psin(n)  ~ pi  + i(n)  = ai+1(n)Ui+1(n)  - QL.+i(n-l) 


psin(n)  = bSm[ui(n)  - Ui  + 1(n)]  - Q^n-l)  + VcJn-l) 


(C.13) 


a^n)  = ^L^n)  + Rj(n) 

Usin(n)  = ui(n)  - ui+1(n) 

QLi(n-l)  = 4Li(n-l)Ui(n-l)  - QL[(n-2) 

QLi+1(n_l)  = jLi+1(n-l)Ui+1(n-l)  — QL.  (n-2) 
QlJh-1)  = |LsinUsin(n-l)  - QLjm(n-2) 
Vc,to(n-1)  = ^Usin(n-1)  4-  Vc  (n-2) 

^sin 


After  rearranging  items,  the  difference  equations  governing  the  shunt  element  can  be 
rewritten  as 


Fsin(n)  = ~ bi(n)Ui(n)  + Hsf(n)U|(n)  - bsinUi  + 1(n) 

Fi+i(n)  = - bsinU|(n)  + Hsb(n)Ui  + 1(n)  - bi+1(n)Ui+2(n) 
psin(n)  = bsin[U;(n)  - Ui+1(n)]  - QLJn-l)  + VC-B(n-l) 

Pi(n)  = bi(n)[ui(n)  - Uj(n)  + Vj(n-l)]  (C.14) 

Fsm(n)  = QLJn-l)  + QLi(n-l)  + bi(n)V1(n-l)  - VcJn-l) 

Fi  + i(n)  s QLi+1(n-D  - QLsm(n-D  - bi+1(n)Vi+1(n-l)  + VcJn-l) 
Hsf<n)  = aj(n)  + bj(n)  + bsin 
Hsb(n)  = ai+1(n)  + bi+1(n)  + bsin 


where 
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Now,  we  consider  the  circuit  representation  at  the  bifurcation  point  of  the  vocal 
tract  and  the  nasal  tract.  From  Figure  C-3,  the  differential  equations  are  given  by 

pi«  - PncW  = ^[Li(t)UNC(t)]  + Ri(t)UNC(t) 

pncO)  ~ pi  + i(t)  = ^[Li+i(t)Ui+1]  + Ri  + 1(t)Ui+1(t) 

PNcW  ~ PNlW  = ^[LniWUni]  + RniWUniW 

Ui(t)  — UNC(t)  = uil(t)  + ui2(t)  + ui3(0 

= 0^(1)  + ^(c^op  j(t))  + ui3(t) 

UNC(t)  = Ui+1(t)  + UN1(t)  (C.15) 

After  applying  the  discretization  rules,  previous  derivation  results,  and  some  simple 
manipulations,  one  obtains  the  following  difference  equations 

Fj(n)  = - bi_1(n)Ui_1(n)  + H^U^n)  - bj(n)UNC(n) 

FNC(n)  = - bi(n)Ui(n)  + HNC(n)UNC(n)  4-  PNC(n) 

Fi+t(n)  = - PNc(n>  + Hi  + 1(n)Ui  + 1(n)  - bi+i(n)Ui+2(n) 
pi(n)  = bj^UiOi)  - UNC(n)  + V^n-l)] 

FNt(n)  = ~ PNc(n)  + HN1(n)UN1(n)  - bN1(n)UN2(n) 

UNC(n)  = Ui  + 1(n)  + UN1(n)  (C.16) 


where 

Fi(n)  = QLi_lLi(n-l)  + bi_1(n)Vi_1(n-l)  - bi(n)Vi(n-l) 
FNc(n)  = QLj(n-l)  + bi(n)Vi(n-l) 

Fi+i(n)  = QLi+1(n-l)  - bi  + 1(n)Vi  + 1(n-l) 

FNi(n)  = QLN1(n-!)  - bN1(n)VN1(n-l) 
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Section  1 

Figure  C-3:  The  circuit  model  of  the  bifurcation  point. 


Figure  CM:  The  circuit  model  of  radiation. 
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HNC(n)  = aj(n)  + bj(n) 

Hi+i(n)  = ai+1(n)  + bi  + 1(n) 

HN1(n)  = aN1(n)  + bN1(n) 

QLi(n-l)  = y Li(n-1  )UNC(n-l ) - QL;(n-2) 

Next,  we  consider  the  radiation  part.  Figure  C-4  shows  the  circuit  model  of  the 
mouth  section  and  radiation.  The  differential  equations  for  this  circuit  are 


Fr(n)  = QL,(n-l)  + bi(n)Vi(n-l)  + br(n)VLr(n-l) 
Hr(n)  = a;(n)  + b;(n)  + br(n) 
t>r(n)  - i ■ t 1 

Rr(n)  ’r  2 Lr(n) 

QLi(n-l)  = |Li(n-l)Ur(n-l)  - QLXn-2) 

Vi>-1)  = L^I)Pr(n_1)  + VI>"2) 


Pi(t)  - Pr(t)  = ^[Li(t)Ur(t)j  + R,(t)Ur(t) 


(C.17) 


o 


The  corresponding  difference  equations  are 

F;(n)  = - bi_1(n)Ui_1(n)  + H^U^n)  - b;(n)Ur(n) 

Fr(n)  = - bi(n)Ui(n)  + Hr(n)Ur(n) 

P;(n)  = bjCnjfUjfn)  - Ur(n)  + Vj(n-l)] 

Pr(n)  = br(n)[ur(n)  - VLr(n-l)] 


(C.18) 


where 


Fi(n)  = QL;_lLi(n-l)  + bi_1(n)Vi_1(n-l)  - bi(n)Vi(n-l) 


Finally,  we  consider  the  excitation  input.  We  have  two  types:  one  is  without  the 
glottal  impedance  and  the  subglottal  system,  one  is  with  the  glottal  impedance  and  the 
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subglottal  system.  We  first  derive  the  difference  equations  for  no  glottal  impedance  and 
no  subglottal  system.  From  Figure  C-5,  it  is  easy  to  obtain  the  difference  equations  from 
previous  derivation  results.  They  are 

Fj(n)  = - Pg(n)  - b1(n)U2(n) 


UjOi)  = Ug(n) 

P i(n)  = bjtn^U^n)  - U2(n)  + V^n-1)]  (C.19) 

where  F^n)  = QLi(n-l)  + bj^V^n-l)  - [a^n)  + b^ju^n). 

For  the  case  of  excitation  source  with  the  glottal  impedance  and  the  subglottal 
system,  we  form  the  circuit  model  in  Figure  C-6.  The  corresponding  differential 
equations  are  given  by 


u0(t)  = 


[Ps,(t)  - ps2(0]  , , 


R 


sl 


+ j[p»i(T)  - P.iWl*  + C,1 1 [Psi(0  - P,2(t>] 


M>  - pa(o] , i 


^s2 


Rs2  + Ls2  I [^s2^  Ps3(t)]^t  + Cs2  ^ [PS2(0  Ps3(t)] 


Ps3(0  1 f 

Rs3  Ls3  J 


t-  P^Wdt  + c. 


s3' 


dPs3(t) 

dt 


Pg(t)  - Psl(t)  = Rg(t)U0(t)  + ^[Lg(t)U0(t)] 
Ug(t)  = U0(t)  + Uj(t) 


(C.20) 


Applying  the  discretization  rules  to  above  equations  and  solving  Psi(n),  i=l,  2,  3,  in  order, 
then  the  difference  equations  can  be  written  as 
Fs(n)  = Pg(n)  + X^OOUjOi) 

Ug(n)  = Uj(n)  + U0(n) 

Fj(n)  = - Pg(n)  + Hj(n)Uj(n)  - b^U^n)  (C.21) 

where 
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Figure  C-5:  The  circuit  model  of  the  first  section  of  vocal  tract  with  excitation. 


Ls3  Ls2  Lsj 


Figure  C-6:  The  circuit  model  of  subglottal  system  and  glottal  impedance 

with  excitation. 
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Fs(n)  = Vsl(n-1)  + Vs2(n-1)  + Vs3(n-1)  + Xsub(n)Ug(n)  - QLg(n-l) 
FjCn)  = QLi(n-l)  - b^V^n-l) 


Vsi(n-1) 


CsiQcsl(n-l) 


VL„(n-D 

L„; 


i = 1,  2,  3 


bsi(n)  1 , T | , 2r  ’ * 1,  2,  3 

R,i(n)  ^ 2 Lsi(n)  + T'-'si 

QcXn-D  = f [Psi(n-D  - Psi+i(n-l)]  - QCii(n-2),  i = 1,  2 

Qc>-D  = |PS3(n-l)  - Qc>-2) 

VLsi(n-D  = T[Psi(n-l)  - Psi+1(n-l)]  + VL>i(n-2),  i = 1,  2 


VLs3(n-l)  = TPs3(n-l)  + VL>3(n-2) 

Qc>-1)  = f [Psi(n-l)  " Psi+i(n-l)]  - QCfi(n-2),  i = 1,  2 
Qc,3(n_1)  = |PS3(n-D  - QCj3(n-2) 

l l 

P»  = IW)  + Xvsk(n-1),  i = 1,  2,  3 

k = i k = i 


Pe(n)  = Vs|(n-1)  + Vs2(n-1)  + Vs3(n-1)  + Xsub(n)U„(n)  - QL/n-l) 
Ql,(”-1)  = |[LB(n-l)U0(n-l)]  - QLj<n-2) 

xsub(n)  = ag(n)  + bsl  + bs2  + bs3 

ag(n)  = Rg(n)  + ^Lg(n) 


We  have  derived  the  difference  equations  for  different  parts  of  the  vocal  system. 
Note  that  in  order  to  set  all  recursive  equations,  Q(n-l)  and  V(n-l),  in  motion,  the  initial 
rest  conditions  of  the  vocal  system  need  to  be  assumed,  i.e.,  Q(0)=0  and  V(0)=0  for  all 
sections.  To  write  out  the  matrix  equations  for  the  entire  vocal  system,  we  assume  that 
CD  the  vocal  tract  has  a number  of  sections  denoted  as  NING, 

[2  nasal  tract  has  a number  of  sections  denoted  as  NTN, 
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SI  the  bifurcation  point  is  located  at  the  downstream  of  a vocal-tract  section 
denoted  as  NTS, 

E]  excitation  source  is  located  at  the  input  to  first  section  of  vocal  tract, 

S one  sinus  is  inserted  at  the  downstream  of  a nasal-tract  section  denoted  as  NS. 

For  the  vocal  system  which  has  no  glottal  impedance  and  no  subglottal  system  but  has  one 
nasal  sinus  coupling,  the  three  sets  of  simultaneous  difference  equations  or  the  matrix 
equations  for  the  pharyngeal,  oral,  and  nasal  tracts,  respectively,  are 

U^n)  = Ug(n) 

k = 0 : Fj(n)  = QL](n-l)  - b^nJV^n-l)  - H^nJU^n) 

= - Pg(n)  - bj(n)U2(n) 

k = 1 : F2(n)  = bj(n)V j(n  -1)  + QLiL2(n-l)  - b2(n)V2(n-l)  + bjOOU^n) 

= H2(n)U2(n)  - b2(n)U3(n) 

2 < k < NTS  : Fk+1(n)  = bk(n)Vk(n-l)  + Q^  Jn-l)  - bk  + 1(n)Vk  + 1(n-l) 

= - bk(n)Uk(n)  + Hk  + 1(n)Uk  + 1(n)  - bk  + 1(n)Uk+2(n) 
if  k + 2 = NTS  + 1 =*  Uk+2(n)  = UNC(n) 

k = NTS  : FNC(n)  = QL^Cn— 1)  + t>|vj3's(n)VNjS(n-l) 

= — t>NTs(n)Uisrrs(n)  + HNC(n)UNC(n)  + PNC(n) 

Pj(n)  = bi(n)[Ui(n)  - Ui  + 1(n)  -I-  Vj(n)]  ; where  i = 1,  . . , NTS 
if  i = NTS,  then  11^5  + jCn)  = UNC(n) 


' Ft  ‘ 

jj 

i 

r 

o 

o 

o 

_9..i 

1 

(TO 

1 

f2 

0 H2  _b2  • • • 0 0 o 

U2 

• 

= 

. 

: 

fnts 

0 0 0 • • HNTS  _bNTS  0 

UNC 

fnc 

0 0 0 •••  -b^-s  Hnc  1 

PNC 
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FNTS  + l(n)  - QLNTS+1(n_1)  *3NTS  + l(n)^NTS  + l(n_^) 

= ~~  PNc(n)  + HNTS  + l(n)UNTS  + l(n)  ~ bNTS  + l(n)Uisirs+2(n) 

Fi(°)  55  Ql,  lLi(n-D  + bi_1(n)Vi_1(n-l)  - bi(n)Vi(n-l) 

= “ bi_1(n)Ui_1(n)  + HjCnJUjCn)  - bj(n)Ui+1(n) 

where  i = NTS+2,..., NING,  if  i = NING  , then  = Ur(n) 

Fr(n)  = b^QC^Vj^QCn-l)  + br(n)VL(n-l)  + Qi^^n-l) 

= ~ bNING(n)UNING(n)  + Hr(n)Ur(n) 

P i(n)  = bjCn^UjCn)  - Ui  + 1(n)  + Vj(n-l)],  where  i = NTS+1,  . . NING-1 
PNiNG(n)  = bNING(n)[uNING(n)  - Ur(n)  + V^^n-l)] 

Pr(n)  = br(n)[ur(n)  - VLr(n-l)] 


"Fnts+i 

-1 

hnts+i 

“bNTS+1 

0 

• • 0 

0 

0 

PNC 

FNTS+2 

0 

“bNTS+l 

HNTS+2 

_bjsrrs+2  • 

• • 0 

0 

0 

H NTS+1 

PlMTS+3 

— 

0 

0 

_biMTS+2 

HNTS+3  • 

. . 0 

0 

0 

H NTS+2 

fning 

0 

0 

0 

0 

• -bNjNC-! 

hning 

-bNING 

UNING 

Fr 

0 

0 

0 

0 

. . 0 

~bNiNG 

Hr 

Ur 

(C.23) 
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FNi(n)  = QLN1(n~l)  ~ bN1(n)VN1(n-l) 


PNc(n)  ‘F  HNj(n)UN1(n)  bN1(n)UN2(n) 
pNi(n)  = QLM_1LHl(n-D  + bNi  _ ! (n) VNi  _ ] (n-1 ) - bNi(n)VNi(n-l) 

= ~ bNi-l(n)UNi-l(n)  + HNi(°)UNi(n)  ~ bNi(n)UNi+l(n) 

1 2,  ...  , Ns,  Ns+2,  . . . , NTN,  if  i = Ns,  then  Unns+](ii)  ==  Unns(h); 

if  i = NTN,  then  UNNTN  + 1(n)  = Uj^r(n) 


psin(n)  Qls>  1)  + Ql^/11  1)  + bNNs(n)^NNs(n—  1)  Vc^(n—  1) 

= ~ bNNs(n)UNNs(n)  + Hsf(n)UNNs(n)  - bsinUj^+jtn) 


FNNs  + l(n)  - Qw/11-1)  “ QLsm(n-l)  + WcJn-V  ~ bNNs  + lWVNNs  + tfo-l) 
= - t>sinUNNS(n)  + Hsb(n)UNNs+1(n)  - bNNs  + 1(n)UNNs+2(n) 


FNr(n)  - bNNTO(n)VNNTN(n-l)  + bNr(n)VLNl(n“l)  + QL^Cn-l) 
= _ bNNTN(n)UNNTN(n)  + FINr(n)UNr(n) 


PNi(n)  _ bNi(n)[uNi(n)  - UNi+1(n)  + VNi(n  - 1)]  , 

where  i = 1,  2,  . . . , NTN;  if  i = Ns,  then  U^+jOi)  = U^n) 
if  i = NTN,  then  UNNTN  + |(n)  = UNr(n) 


psin(n)  - bsm[uNNs(n)  - U^+^n)]  - QLjm(n-l)  + VcJn-\) 
PNr(n)  = bNr(n)[uNr(n)  - VLJn-l)] 


PN1 
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HN1 
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• • 0 
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HN2  . 
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0 . 
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• • 0 
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• FInntn 
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UnNs 
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UnNs+1 

~bNNTN 

Unntn 

HNr 

. UN' . 

(C.24) 
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If  there  is  no  nasal  sinus  cavity,  the  matrix  equation  (C.24)  reduces  to  the 
following: 

FNi(n)  = QLN1(n-l)  - bN1(n)VN1(n-l) 

— — Pjsjc(n)  + F^Ni(n)UjSf  j (n)  — bN1(n)UN2(n) 

FNi(n)  = QLNi_lLNi(n-D  + FNi-1(n)VNi-1(n-l)  - bNi(n)VNi(n-l) 

= - bNi.^nJUNj.jCn)  + HNi(n)UNi(n)  - bNi(n)UNi  + 1(n) 
where  i = 2,  ...  , NTN;  if  i = NTN,  then  UNNTN  + 1(n)  = UNr(n) 

FNr(n)  = bNOT^(n)VNNTN(n-l)  + bNr(n)VLN(n-l)  + QLNNTN(n-l) 

= — b>nsrro(n)UlwrN(n)  + HNr(n)UNr(n) 

PNi(n)  = bNi(n)[uNi(n)  - UNi+1(n)  4-  VNi(n-l)]  , 

where  i = 1,.  . ., NTN;  if  i = NTN,  then  Uj^nj+^n)  = UNr(n) 

PNr(n)  = bNr(n)[uNr(n)  - V^n-l) 
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(C.25) 
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If  the  glottal  impedance  and  the  subglottal  system  are  included,  the  matrix  equation  (C.22) 
extends  to  the  following: 
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From  the  derived  discrete-time  acoustic  matrix  equations,  we  can  form  four 
different  vocal  system  model  structures,  which  are: 

CD  (C.22),  (C.23),  and  (C.24)  for  the  vocal  system  model  with  nasal  sinus  but  no 
glottal  impedance  and  no  subglottal  system. 

El  (C.22),  (C.23),  and  (C.25)  for  the  vocal  system  model  with  no  nasal  sinus  and 
no  glottal  impedance  and  no  subglottal  system. 

0]  (C.26),  (C.23),  and  (C.24)  for  the  vocal  system  model  with  nasal  sinus  and 
with  the  glottal  impedance  and  the  subglottal  system. 

g]  (C.26),  (C.23),  and  (C.25)  for  the  vocal  system  model  with  no  nasal  sinus  but 
with  the  glottal  impedance  and  the  subglottal  system. 


There  are  three  matrix  equations  for  each  structure.  Each  matrix  equation  can  be 
written  as  y = A • x,  where  A is  a non-square  band  diagonal  sparse  matrix  of 
coefficients,  y is  a column  vector  of  force  constants,  and  x is  the  unknown  column  vector. 
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The  elements  in  the  sparse  coefficient  matrix  and  the  force  constants  inside  the  vocal 
system  are  defined  in  the  previous  paragraphs.  In  addition  to  the  three  matrix  equations, 
we  need  the  boundary  condition  at  the  nasal  coupling  point, 

UNC(n)  = Ui+i(n)  + UN1(n).  For  cases  [3]  and  [4],  we  need  one  more  boundary 
condition  at  the  glottis,  Ug(n)  = U^n)  + U0(n).  It  may  also  be  noted  that  the  three 
matrix  equations  for  each  case  are  coupled  by  the  term  PNC(n)  and  boundary  condition(s). 
As  Maeda  (1982a)  pointed  out,  if  we  eliminate  PNC(n)  and  UNC(n)  analytically,  the 
formulation  results  in  an  unstable  system.  Therefore,  PNC(n)  and  UNC(n)  are  included  in 
the  unknown  column  vectors  in  order  for  stable  solutions. 

There  are  many  methods  that  can  be  used  to  solve  these  sparse  linear  system 
equations  (see  Chapter  2 of  Press  et  al.  (1992)).  Since  the  coefficient  matrices  are  band 
diagonal  sparse  matrices,  an  efficient  elimination  procedure  followed  by  a substitution 
procedure  can  be  used.  Once  the  values  of  Uj(n)  and  UNi(n)  for  all  i,  PNC(n),  and  UNC(n) 
are  solved,  the  pressure  P;(n)  and  P^(n)  for  all  i,  and  the  volume  velocity  u^n)  and 

uNtf(n)  f°r  i can  be  computed.  Then,  the  force  constants  and  coefficient  matrices  are 
updated  for  the  next  recursion. 


APPENDIX  D 

GUIDELINE  AND  APPLIED  SENTENCE  RESULTS  OF  THE  OPTIMIZATION 

PROCEDURE 

In  Chapter  3,  we  covered  the  details  of  speech  inverse  filtering  for  the  simulated 
annealing  optimization  algorithm.  The  annealing  parameters  control  the  performance  of 
the  optimization  process.  The  various  acoustic  characteristics  of  the  speech  signal  and 
target-frame  selections  made  the  optimization  process  with  default  values  of  the  annealing 
parameters  (Table  3-2)  difficult  to  manage  for  various  cases.  Although  we  did  not  know 
the  combinations  of  the  annealing  parameters  that  perform  well  in  the  optimization 
process  for  different  target  frames,  the  following  guideline  provides  some  rules  for 
adjusting  the  appropriate  annealing  parameters. 

ffl  Set  the  desired  nasalization  extent  and  set  the  number  of  dimensions  of  the 
articulatory  vector  at  the  appropriate  dimensions,  e.g.,  M=8  for  front  vowels, 
M=9  for  nasalized  front  vowels,  M=ll  for  middle,  back  vowels,  and 
semivowels,  and  M=12  for  nasalized  vowels.  See  the  descriptions  in  section 
3.3.1.  Start  the  optimization  process  with  the  default  initial  articulatory  vector 
and  the  default  annealing  parameters.  If  the  error  distance  is  less  than  1%  after 
the  process  stops,  go  to  step  [5].  If  not,  go  to  step  [2]. 

\2\  Check  if  the  current  vocal  tract  shape  (or  cross-sectional  area)  is  reasonable.  If 
the  shape  is  not  reasonable  go  to  step  [3].  Otherwise,  record  the  error  distance 
as  £p  and  the  current  final  temperature  as  Tp.  Then  set  the  initial  temperature 

T = [TpJ , where  [•  J represents  the  floor  value  of  the  argument.  Start  the 
optimization  process  again.  If  the  new  error  distance  is  less  than  ep,  then  this 
step  is  repeated  untill  the  error  criterion  is  met.  If  not  go  to  step  [4]. 
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EC  Use  the  control  button  with  label  Shape  Recover  on  the  Articulatory 
Optimization  Setup  popup  window  to  recover  the  vocal  tract  shape 
(articulatory  vector)  to  the  initial  settings.  Several  adjustments  of  annealing 
parameters  can  be  used.  The  following  order  of  adjustments  are 
recommended:  raise  the  initial  temperature,  increase  the  value  of  the  reduction 
factor,  increase  the  total  number  of  evaluations,  and  change  the  other 
annealing  parameters.  Then  begin  the  process  and  apply  step  \2\. 

E ] Recover  the  vocal  tract  shape  as  described  in  step  {3}.  Increase  the  number  of 
dimensions  of  the  articulatory  vector  from  M=8  to  M=ll  and  start  the  process. 
Apply  step  [2]. 

[5]  Check  the  vocal  tract  shape  with  X-ray  tracings  or  schematic  vocal  tract 
profiles  as  published  in  the  literature.  If  the  vocal  tract  outline  is  similar  to 
those  published,  then  the  optimization  process  is  done.  If  not,  this  may  mean 
that  the  “ventriloquist  effect”  has  occurred.  One  can  adjust  the  settings  of  the 
nine  articulatory  sliders  in  the  Articulatory  Position  Settings  popup  window  so 
that  initial  configuration  is  closer  to  the  true  outline.  Then  go  to  step  [T\  to 
start  the  optimization  process  again. 

E If  the  above  steps  have  been  tried  and  the  error  criterion  is  still  not  satisfied, 
then  go  back  to  the  target-ffame  selection  phase  and  re  select  the  current  target 
frame.  Start  the  optimization  process  from  the  step  □]. 

The  above  guideline  has  been  applied  to  two  speech  sentences,  one  speech  token 
spoken  by  two  male  subjects,  A and  B.  The  speech  token  is  “We  were  away  a year  ago.” 
Figure  D-l  and  Figure  D-2  show  the  optimized  target  frames  for  the  speech  signals 
spoken  by  subject  A and  subject  B,  respectively.  See  Appendix  A for  the  applied  vowel 


results. 
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Frame  1 Vocal  tract  cross-sectional  area  (csi'ij 


Lips 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Hyoid 

Velum  Posi. 


4 783364 
3 554248 


-0  295214 
(4. 839606 
(2. 979713, 

0. 076798 
0.414483 
-0. 296484 

(2045000,  5 090000 


Formant 

Target 

Model 

FI 

322  4 

322  5 

F2 

656  4 

656  7 

F3 

2720  9 

2721.1 

F4 

3660  8 

3657.7 

Error 

0 03069% 

Frame  4 Vocal.  tract  cross -sect Lanai  area  0:e‘'2> 


Lips 


J aw  angle 

-0. 331027 

Tongue  Tip  : 

(4  824987,  5 

223393) 

Tongue  Body 

(3  459779,  4 

745768) 

Lip  open 

0 173639 

Lip  protru. : 

0.180013 

Hyoid  : 

0.231494 

Velum  Poei. : 

<2.045000,  5 

090000) 

Formant  Target  Model 

JND 

PI  287.5  287.4 

Y 

F2  2006.5  2007.1 

Y 

P3  2347.0  2346.9 

Y 

F4  3296.9  3303.0 

Y 

Error 

0 04191% 

anal  area  (cm 

"2 ) 

Frame  2 Vocal  react  cross-sectional  area 


Jaw  angle 

-0  352601 

Tongue  Tip 

<4  091919,  5 

010448), 

Tongue  Body 

(3  027564,  4 

002414). 

Lip  open 

0 085574 

Lip  protru.  : 

0.459394 

Hyoid 

-0  051749 

Velum  Poei. : 

<2.045000,  5 

090000) 1 

Formant  Target  Model 

JND 

FI  328.4  328.4 

Y 

F2  807.7  807.4 

Y 

F3  2334.1  2333.3 

Y 

F4  31343  3135.4 

Y 

Error 

0 02787% 

Jaw  angle 

-0.265212 

Tongue  Tip 

(4  173810,  4 

973047) 

Tongue  Body 

<3  644336,  4 

408756) 

Lip  open 

0 077368 

Lip  protru.  : 

0. 350424 

Hyoid 

-0.030946 

Velum  Posi. : 

<2.045000,  5 

090000) 

Formant  Target  Model 

JND 

FI  319.8  3198 

Y 

F2  1390.2  13899 

Y 

P3  2252.8  2251.9 

Y 

P4  3357.3  3358.6 

Y 

Error 

0 02651% 

ow»3.  area  (ex 

-?.) 

Frame  3 V!,«i  t:. 


ociortal  xese  (ix'Z) 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  pcotru. . 
Hyoid 

Velum  Posi. : 


-0  383820 

(4  744510,  5 164305) 
(3  601607,  4 679456) 
0 280142 
0.306329 
0 262138 

<2.045000,  5 090000) 


Formant 

Target 

Model 

JND 

FI 

328.9 

328  9 

Y 

P2 

1934.4 

1934  .2 

Y 

F3 

2297.1 

2296.7 

Y 

F4 

3228.4 

3227.2 

Y 

Error  . 

0 01743% 

Jaw  angle 
Tongue  Tip  ■ 
Tongue  Body 
Lip  open 
Lip  protru.  : 
Hyoid 

Velum  Posi. : 


-0  381843 

(4  860657,  4 912795) 
<2  980935,  3 750502) 
0 079227 
0.373735 
-0.033129 

(2.045000,  5 090000) 


Formant 

Target 

Model 

JND 

FI 

329.1 

329.1 

Y 

P2 

664.9 

664.4 

Y 

P3 

2190.2 

2190  3 

Y 

P4 

3337. 3 

3339.8 

Y 

Error  . 

0 03616% 

Figure  D-l : The  optimized  target  frames  of  the  sentence  “We  were  away  a year  ago,’ 

spoken  by  male  subject  A. 
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Frame  7 


cross -sectional  area  (cfc''!*) 


Frame  10 


Vocal  tract  cross-  sectional  ; 


f--F* 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru.  : 
Hyoid 

Velua  Posi. . 


-0.400354  , 

(4  581029,  S 228440)  , 
(2  984698,  3 880309) 
0.158502 

0.419453  1 

-Q. 293551  1 

(2. 045000.  5 090000) 1 


Foraant 

Target 

Model 

JND 

FI 

383.6 

383.8 

T 

F2 

813  6 

013.3 

Y 

F3 

1890.2 

1889.5 

Y 

F4 

3259.6 

3264.4 

Y 

Error  . 

0 061374 

Frame  8 


Vocal  tract  cross -sectional  area 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru.  : 
Hyoid 

Velua  Posi. : 


-0  401027 

(3  710760,  S 614390) 
(3  044841,  4 003826) 
0 159859 
0.428789 
-0.285939 

(2.045000,  5 090000) 


Foraant 

Target 

Model 

JND 

FI 

423.7 

423.7 

Y 

F2 

1003.4 

1003.5 

Y 

F3 

1617.6 

1617.7 

Y 

F4 

3195.1 

3196.1 

Y 

Error  . 

0 009554 

Frame  9 


Vocal  r.rsnr.  cross  -ascr.i final  are 


J aw  angle 
Tongue  Tip  ■ 
Tongue  Body 
Lip  open 
Lip  protru. : 
Hyoid 

Velua  Posi  : 


-0.401009 
(3  736150,  4 931845) 
(2  958587,  3 740478) 
0 129357 
0.429982 
-0.168980 
(2.045000,  5 090000) 


oraant 

Target 

Model 

JND 

PI 

376.2 

376.2 

Y 

F2 

631.4 

631.6 

Y 

F3 

2238. 1 

2238.3 

Y 

F4 

3182. 7 

3298.0 

Y 

Error 

0 550614 

Frame  11  . 


c&i  tract  crosi-seccional  area  £ea''2) 


Jaw  angle 
Tongue  Tip  : 
Tongue  Body 
Lip  open 
Lip  protru.  : 
Hyoid 

Velua  Posi . : 


-0  396526 
(S  208612,  4 633890) 
(2  990479,  3 917469) 
0 152407 
0. 383981 
-0.205112 
(2.045000,  5 090000) 


Foraant 

Target 

Model 

JND 

FI 

413.3 

413.3 

Y 

F2 

815.4 

815.6 

Y 

F3 

2312.4 

2312.2 

Y 

F4 

3186. 7 

3187.7 

Y 

Error  . 

0 014064 

Frame  12  , 


seal  tract  crose- 


laccionau.  ar: 


Lips 


J aw  angle 
Tongue  Tip  : 
Tongue  Body: 
Lip  open 
Lip  protru.  : 
Hyoid 

Velua  Posi. . 


-0  401388  , 

(3  700464,  S 577478) 
(3.079889,  3 860703), 
0.239711 

0.564889  1 

-0.298830  1 

(2.045000,  5 090000) 1 


’oraant  Target 

Model 

JND 

FI 

450  7 

450  0 

Y 

F2 

1064.7 

1064.8 

Y 

F3 

1688. 0 

1688.3 

Y 

F4 

3285  1 

3286.1 

Y 

Error  : 

0 054144 

J aw  angle 

-0  400731 

Tongue  Tip  : 

(4  331834,  4 

522106) 

Tongue  Body 

(3  575943,  4 

261568) 

Lip  open 

0 266536 

Lip  protru.  : 

0.113469 

Hyoid 

-0.286778 

Velua  Posi. : 

(2.045000,  5 

090000) 

Foraant  Target  Model 

JND 

FI  500.3  500.1 

Y 

F2  1411.4  1411.6 

Y 

F3  2203.5  2204.7 

Y 

F4  3354.0  3355.4 

Y 

Error 

0 032034 

Figure  D-l:  Continued 
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Frame  13  mm 


cress -sectional  crca 


J aw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. : 
Hyoid 

Velum  Posi. : 


-0. 320191 

(4  4 380S 1,  4 767700) 
(3  474534,  4 529708) 
0 330985 
0.195973 
-0.247216 

(2  045000,  5 090000) 


Fornant  Target 

Model 

JND 

FI 

450.0 

450.0 

Y 

F2 

1713  9 

1713.5 

Y 

F3 

2339.7 

2340.4 

Y 

F4 

3479.3 

3479.2 

Y 

Error  . 

0 015564 

Frame  16 


Vocal  tract  cros?-«ctional  area  ic*'2) 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Hyoid 

Velun  Posi. 


-0.310895 

(4  703774,  5 410002) 
(3  345196,  4 732948) 
0 498416 
0.038157 
0.202666 
(2.045000,  5 090000) 


ormant 

Target 

Model 

JND 

FI 

275.  2 

275.3 

Y 

F2 

2086. 6 

2086.5 

Y 

F3 

2871. 7 

2071.6 

Y 

F4 

3699. 3 

3699.2 

Y 

Error  : 

0. 004714 

Frame  14  v„,i 


cross-sectional  area 


Frame  17 


vocal  trace  cross-sectional  area  'at~2) 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru.  . 
Hyoid 
Velun  Poei 


-0  296191 

(5  070441,  4 866640) 
(3  302177,  4 607279) 
0 408096 
0.178389 
-0. 065855 

<2.045000,  5 090000) 


Formant 

FI 

F2 

F3 

F4 


Target 

370.7 

1887.4 

2539.3 

3577.7 


Model 

371.0 


2539.7 

3576.0 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Hyoid 

Velun  Posi. 


-0  316101 
(4  556979,  5 360120) 
(3  480864,  4 809723) 
0 57S142 
0.045386 
0.295078 
(2.045000,  5 090000) 


Formant 

FI 

F2 

F3 

F4 


Target 
274.5 
2128. 8 
3281. 7 
3612.5 


Model  ; 
274.5 
2128  8 

3281.4 

3613.4 

0 009514 


Frame  15 


cress -sectional  crca 


Frame  18 


: cross -sectional  ares  (ca'-2> 


J aw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru.  . 
Hyoid 

Velum  Poei  : 


-0  263461 

(4  864736,  5 085028) 
(3  253271,  4 S9S029) 
0 274938 
0.105571 
0.1275 95 

(2.045000,  5 090000) 


Formant  Target  Model 


FI 

F2 

F3 


346.1 
1073.4 
2511  8 
3565  4 


346.2 

1874.3 

2512.0 

3564.1 

0 030514 


JND 

T 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Hyoid 

Velum  Posi. 


L\px 


-0  381602 

(5  112584,  4 829454) 
(3  391806,  4 715267) 
0 277634 
0 167777 
0 197406 

(2  045000,  5 090000) 


Formant 

FI 

F2 

F3 

F4 


Target 

317.3 

2001.0 

2336.6 

3343.2 


Model 
317  3 
2000  6 
2336  8 
3343  5 

0 011584 


Figure  D-l:  Continued. 
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Frame  19  Vocal  tract  cr 


actional  area  (cw'i-j 


J aw  angle 

-0 

390341 

Tongue  Tip 

(4 

500666, 

Tongue  Body 

(3 

747747, 

Lip  open 

0 

189774 

Lip  protru.  : 

0 

251803 

Hyoid 

-0 

241062 

Velum  Posi  : 

(2 

045000. 

4 943980) 

4 690444) 


Formant  Target 

No  del 

JND 

FI 

365.1 

365  2 

Y 

F2 

1562  8 

1562  7 

Y 

F3 

1809.9 

1890  7 

Y 

F4 

3371.3 

3370  0 

Y 

Error 

0.02512% 

Frame  22 


Vorol  tract  croc?- sectional  area  • cm'2) 


'V.. 


J air  angle 
Tongue  Tip  : 
Tongue  Body: 
Lip  open 
Lip  protru.  : 
Hyoxd 

Velum  Posi. : 


laps 


-0  379224 
(4  562151,  S 107852) 
(3  487845,  4 818657) 
0 147268 
0 352104 
-0  299674 
(2  045000, 


5 090000) 


Formant 

Target 

No  del 

JND 

FI 

300.5 

300  5 

Y 

F2 

1295.4 

1295  5 

Y 

F3 

1827.1 

1927  2 

Y 

F4 

3807.5 

3565  1 

Y 

Error 

0.95769% 

Frame  20  v,«i 


;onaI  area 


Frame  23 


vocal  trace  cross-sectional  area 


Lip* 


Jaw  angle 
Tongue  Tip  • 
Tongue  Body 
Lip  open 
Lip  protru  : 
Hyoid 

Velum  Posi  : 


-0  401420 

(3  700176,  5 597184) 
(3  293831,  3 886328) 
0 297621 
0.  647351 
-0.299500 

(2.045000,  5 090000) 


0 297112 
(4  524280, 


Formant 

FI 

F2 

F3 

F4 


Target 

445.8 

1277.3 

1710.9 

3261.8 


Model 

445.4 

1276.9 

1722.5 

3261.0 

0 21271% 


JND 

Y 


Jaw  angle 
Tongue  Tip 
Tongue  Body  (3  563238, 
Lip  open  0 125476 
Lip  protru. 

Hyoid 

Velum  Posi. 


0.343277 

-0.284596 

(2.045000, 


4 859411) 
4 394766) 


5 090000) 


Formant  Target  Model 
FI  387.7  387.9 

F2  1373.2  1373.2 

F3  2176.9  2177.1 

F4  3363.7  3364.0 


JND 

Y 


0 01419% 


Frame  21 


Veto!  tract  cross -sectional  . 


Frame  24 


Vocal  tract  cr. 


ss-iecric.r.&l  area  {em'*2) 


Jaw  angle  -0  401201 


Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru.  : 
Hyoid 

Velum  Posi  : 


(3  700115,  5 443300) 
(3  376132,  3 720156) 
0 321179 
0.609505 
-0.298412 

(2.045000,  5 090000) 


Formant  Target  Model  JND 


FI 

F2 

F3 


478.6 

1323.2 

1910.2 

3311.2 


478.6 

1323.0 

1910.0 

3311.1 

0 00853% 


J aw  angle 
Tongue  Tip  : 
Tongue  Body 
Lip  open 
Lip  protru. : 
Hyoid 

Velum  Posi. : 


-0  400097 

(3  708182,  5 350190) 
(3  309445,  3 573624) 
0 253914 
0.275782 
-0.295448 

(2.045000,  5 090000) 


Formant  Target  Model 


f: 

VS 

F3 

F4 


480.2 
1238. 7 

2149.4 

3365.5 


480  2 
1238.7 
2149.1 
3365  2 

0 00563% 


JND 

Y 


Figure  D-l:  Continued. 
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Frame  25 


ee-*©::  t ionel  hcj*©  (ix'Z) 


Frame  26 


oj’e-ewcMor.eu.  a: 


Jaw  angle 

-0  400450 

Tongue  Tip 

(3  701344,  S 

207943) , 

Tongue  Body 

(3  157058,  3 

640734) , 

Lip  open 

0 296594 

Lip  piotru.  : 

0.497078 

Hyoid 

-0.299717 

Velua  Posi. : 

(2.045000,  5 

090000) 1 

Poraant  Target  Model 

JND  ' 

FI  518.4  518.4 

Y ' 

F2  1126.7  1129.3 

Y 1 

F3  2196.0  2196.2 

Y ' 

F4  3398.9  3401.4 

Y • 

Error 

0 083414 

J aw  angle 

-0  400763 

Tongue  Tip  : 

(3  858089,  4 

994038) 

Tongue  Body- 

(3  056451,  3 

561519) 

Lip  open 

0 294234 

Lip  protru.  : 

0.547207 

Hyoid 

0.012368 

Velua  Posi. : 

(2.045000,  5 

090000) 

Poraant  Target  Model 

JND 

PI  504.5  504.3 

Y 

P2  987.9  907.8 

Y 

P3  2329.8  2326.9 

Y 

P4  3416.9  3417  4 

Y 

Error 

0 04494* 
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Frame  1 Vooml  tract  acee  it*"2) 


Jaw  angle 

. -0. 

335120 

l 

i 

Tongue  Tip 

(3  703419,  5 

311586) , 

Tongue  Body  (2  992716,  3 

823177) t 

Lip  open 

0 

070736 

Lip  protru 

0.451054 

Hyoid 

: 0. 

247373 

Velum  Poei 

(2.045000,  5 

090000) • 
• 

Formant 

Target 

Model 

JND 

FI 

308.2 

308.2 

Y ‘ 

F2 

725.6 

725.6 

Y 1 

F3 

1973.4 

1973.3 

Y 1 

F4 

3134.4 

3131.1 

Y 1 

Error  . 

802402% 

i 

l 

iw.0-1  XJHO 

i 

l 

;J  ^ 

1 

1 

‘ 

f* 

1 

1 

Frame  4 Vtwral  tcxct  s’.cno.x-  aentj.onol  mm  (mi' 2 


Jaw  angle 

-0.397331 

Tongue  Tip  . 

(5.414547, 

5.054503) 

Tongue  Body 

(3  069459, 

3 678089) 

Lip  open 

0063201 

Lip  protru  : 

0.306152 

Hyoid 

0 036285 

Velum  Poei 

(2  045000, 

5 090000) 

Formant  Target 

Model 

JND 

FI 

309.5 

309.5 

Y 

F2 

852.5 

852.5 

Y 

F3 

1789  8 

1790  3 

Y 

F4 

3300  0 

3299  2 

Y 

Error  • 

0 01403% 

Frame  2 


Frame  5 


Vocal  txo«:t  cross -Ofcctsnnal  «■»  (■:*•'?) 


J aw  angle 

-0. 373695 

Tongue  Tip  : 

(4.811612, 

5.183755), 

Tongue  Body 

(3  576164, 

4 719492), 

Lip  open 

0.212940 

Lip  protru  : 

0.197551 

Hyoid 

0.094424 

Velum  Poei 

(2  045000, 

5 090000)' 

Formant 

FI 

F2 

F3 

F4 


Target 
311.0 
1916.5 
2256  7 
3282  3 


Model 
310.9 
1917. 2 
2259  0 
3283  0 

0.02022% 


JND 

Y 


Jaw  angle 
Tongue  Tip  . 
Tongue  Body: 
Lip  open 
Lip  protru  : 
Hyoid 
Velum  Poei 


-0.398308 
(4.012872,  5.636487) 
(3  060456,  3 686242) 
0.142035 
0. 144014 
0.019017 
(2  045000,  5 090000) 


Formant 

FI 

F2 

F3 

F4 


Target 
388.8 
1011.7 
1574  5 
3314  2 


Model 
388.7 
1011.6 
1573  6 
3316.9 

0.03175% 


JND 

Y 


Frame  3 ».«o. 


: oeo  o*,o k v.i « 


Frame  6 


-sectional  area  (ca'2) 


** 

jps  | 

J aw  angle 

-0.289567 

Tongue  Tip 

(4  989875, 

5.056913), 

Tongue  Body 

(3  659096, 

4 719562), 

Lip  open 

0.153200 

Lip  protru  : 

0 257071 

Hyoid 

0.195418 

Velum  Poei 

(2  045000, 

5 090000)' 

ormant  Target 

Model 

JND 

FI 

297.4 

297.4 

Y 

F2 

1894. 2 

1894.4 

Y 

F3 

2338  4 

2338  7 

Y 

r4 

3278. 2 

3272. 6 

Y 

Error  : 

0.03528% 

Jaw  angle 

-0.400235 

Tongue  Tip  . 

(3.712281,  5.706581) 

Tongue  Body 

(3  106781,  3 889300) 

Lip  open 

0.153682 

Lip  protru  : 

0. 386648 

Hyoid 

-0.2991S7 

Velum  Poei 

(2  045000,  5 090000) 

Formant  Target  Model 

JND 

FI  397.0  397.0 

Y 

F2  1084.3  1083.8 

Y 

F3  1440  6 1441  2 

T 

F4  3330.0  3332.6 

Y 

Error 

0.03570% 

Figure  D-2:  The  optimized  target  frames  of  the  sentence  “We  were  away  a year  ago,’ 

spoken  by  male  subject  B. 
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Frame  7 »,«: 


:resa-aecT;ior.al  area  (cm' 2) 


Lipa 


Frame  10  Vocal  tract  crcss-oectier.al  area  (c*'2) 


Frame  8 vMai 


Glottis 


Jaw  angle 
Tongue  Tip  . 
Tongue  Body 
Lip  open 
Lip  protru  : 
Hyoid 
Velum  Posi 


-0.401159 

<3.701498,  5.597988) 
(3  202925,  3 875268) 
0. 204319 
0.409777 
-0.297142 

(2  045000,  5 090000) 


Formant  Target  Model  JND 


FI 

F2 

F3 


452.7 

1207.1 
1706  7 

3281. 2 


448.  6 
1207. 1 
1713  4 
3289  4 

0 41168% 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Hyoid 
Velum  Posi 


-0.389529 

<3.966737, 


S. 098867) 


(3  608622,  4 517625) 
0.335497 
0.135433 
-0.223796 
(2  045000, 


5 090000) 


Formant  Target  Model  JND 


FI 

F2 

F3 

F4 


464.0 
1704.9 
2267  2 
3423  2 


464.  0 
1705. 3 
2267  2 
3423  7 

0 01066% 


Frame  11 


Vocal  trout  croao-ofici.t:.’ 


Frame  9 


J aw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru  : 
Hyoid 
Velum  Posi 


-0.397946 

(4.259106, 


5.130637) 


<2  977700,  3 916171) 
0.130780 
0.588537 
0.020919 
(2  045000, 


5 090000) 


Formant 

FI 

F2 

F3 

F4 


Target 
329.8 
680.4 
1965  9 
3114  7 


Model  C 
329.9 
680.4 
1965  7 
3119  6 

0.03730% 


Jaw  angle 
Tongue  Tip  : 
Tongue  Body 
Lip  open 
Lip  protru.  : 
Hyoid 
Velum  Posi 


4.844986) 
4 624095) 


-0.288207 
<4.865392, 

(3  478848, 

0.306237 
0.268265 
-0.011991 
<2  045000,  5 090000) 


Formant  Target  Model  JND 


FI 

72 

F3 


376.7 
1856. 3 
2417  9 
3423. 7 


376.7 
1856.0 
2418  2 
3424  4 

0.01196% 


xtt'.CK  kcjso 


Frame  12 


Vocsl  trout  c roao - o«d: j :n:tl  oies  (■:*'?) 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Byoid 
Velum  Posi 


-0.392355 

(4  247635,  4.623539) 
(3  442429,  4 342550) 
0.164341 
0.264943 
-0.276863 

(2  045000,  5 090000) 


Formant  Target  Model  JND 


FI 

F2 


453.6 
1239.5 
2142  4 
3341.7 


453.4 
1239.4 
2143  0 
3341.3 

0 02390% 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Hyoid 
Velum  Posi 


-0. 309652 

(4.659431,  4.892226) 
(3  386737,  4 583634) 
0. 219131 
0.129724 
0.077292 

<2  045000,  S 090000) 


Formant  Target  Model  JND 


F2 

F3 

F4 


388.4 
1798. 2 
2345  4 
3396.4 


388.2 
1798. 3 
2345  4 
3395.8 

0.01656% 


Figure  D-2:  Continued 
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Frame  13  ***1 


tract  crcs3-3ecri«snal  area  (ca'2) 


Jaw  angle  . -0.288301 


Tongue  Tip 
Tongue  Body- 
Lip  open 
Lip  protru  : 
Hyoid 

Velum  Posi  : 


(4.823072,  5.259611) 
(3  461715,  4 731454) 
0.232324 
0.340106 
0. 148036 

(2  045000,  5 090000) 


Formant  Target 

Bodel 

i 

JHD  ( 

\ Formant 

Target 

model 

JHD 
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Figure  D-2:  Continued. 


189 


Frame  19  Vocal  tract  crosa-aftcr/ional 


Frame  21 


Yrcal  tract  cross- ascr/i final  area  (or'’?'. 


Frame  20  ««* 


Jaw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru. 
Hyoid 
Velum  Poai 


-0  339906 
(4.557400,  4 988651) 
(3  649744,  4 632542) 
0 213379 
0 367967 
-0  286820 
(2  045000,  5 090000) 


Formant  Target  Model  JNE 


FI 

F2 

F3 

F4 


402  5 
1608  0 
2006  4 
3364  3 


402.5 
1608.0 
2008  4 
3364  2 

0 02850% 


Jaw  angle  -0.401105 
Tongue  Tip  : (4.589791, 
Tongue  Body-  (3  504574, 
Lip  open 
Lip  protru. 

Hyoid 
Velum  Poai 


4.599903) 
4 374523) 


0.243942 
0.425875 
-0.265436 
(2  045000,  5 090000) 


Formant 

FI 

F2 

F3 

F4 


Target 
461.8 
1383.6 
2106  4 
3263  8 


Model  ; 
461.8 
1383.7 
2106  4 
3263  9 

0 00440% 


•.or. a 3.  area 


Frame  22 


al  rranr.  crofii-asctisns!  area  (cm''2) 


J aw  angle 
Tongue  Tip  . 
Tongue  Body: 
Lip  open 
Lip  protru. : 
Hyoid 
Velum  Poai 


-0. 385660 
(4.214625, 
(3  769741, 
0.220392 
0.347459 
-0.279147 
(2  045000, 


4 894730) 
4 356261) 


5 090000) 


ormant  Target 

Model 

JND 

FI 

470.4 

470.5 

Y 

i'i 

1484.7 

1484.7 

Y 

F3 

2110  8 

2110  6 

Y 

F4 

3348  5 

3348  3 

T 

Error 

0 00956% 

J aw  angle 
Tongue  Tip 
Tongue  Body 
Lip  open 
Lip  protru 
Hyoid 
Velum  Poai 


-0.401240 

(4.822456,  5.040797) 
<3  387999,  4 662027) 
0.088662 
0.495322 
-0.287198 

(2  045000,  5 090000) 


ormant  Target 

Model 

JND 

FI 

319.4 

319.5 

Y 

F2 

915.3 

915.  3 

Y 

F3 

1878  8 

1878  8 

Y 

F4 

3156  5 

3160  9 

r 

Error  : 

0.03791% 

Figure  D-2:  Continued. 
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